ScaffBench 2.1: P1 — reproducibility metadata (toolchains + generator version)#257
Conversation
… version) collectMetadata now records what's needed to qualify a published pass/fail: - bfGeneratorVersion: the resolved create-better-fullstack@latest version actually exercised by the assisted paths (gitHead only describes the local checkout; best-effort via `npm view`, undefined offline). - toolchains: host rustc/cargo/go/dotnet/python/uv/protoc/psql versions, since validation runs non-frozen network installs against them. - environmentQualified: true — flags that pass/fail is environment-dependent. Verified end-to-end via the matrix run (bfGeneratorVersion=2.1.1; all local toolchains captured). Additive only — no effect on scoring/validation; 26 tests still pass, type-check unchanged.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c720478d8f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const bfGeneratorVersion = await tryCommandText( | ||
| "npm", | ||
| ["view", "create-better-fullstack@latest", "version"], | ||
| process.cwd(), |
There was a problem hiding this comment.
Pin the recorded generator version
For long or resumed publishable runs, this records the latest dist-tag once at startup, but the CLI/MCP paths still invoke better-fullstack@latest / create-better-fullstack@latest later for each assisted run. If a new CLI is published while the matrix is running or before a resume, later assisted runs exercise a different package than bfGeneratorVersion, making the reproducibility metadata misleading. Either substitute this resolved version into the assisted commands/MCP config or record the resolved package version per run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed by actually pinning the version. resolveBfVersion() resolves the @latest dist-tag once at run start (RESOLVED_BF_VERSION), and every harness-controlled assisted invocation — canonicalCommand, the MCP config, the doctor lane, and the CLI prompt instruction — now references that exact version via bfSpec() rather than @latest. Metadata reuses the same resolved value, so a publish mid-run can no longer make later runs diverge from bfGeneratorVersion. Verified: spec.json and the MCP config pin better-fullstack@2.1.1.
…dex #257) Recording bfGeneratorVersion alone was misleading: the CLI/MCP paths still invoked @latest per run, so a publish mid-matrix/resume would exercise a different package than metadata claims. Now resolveBfVersion() resolves the @latest dist-tag ONCE at run start (RESOLVED_BF_VERSION) and every harness-controlled assisted invocation — canonicalCommand, the MCP config, the doctor lane, and the CLI prompt instruction — references that pinned version via bfSpec(). Falls back to "latest" offline; metadata reuses the same value (no duplicate npm view). Verified: spec.json canonical commands and the MCP config pin better-fullstack@2.1.1.
Summary
Phase 5 — reproducibility metadata. Stacked on #256; retargets to
Developmentas the stack merges.A published pass/fail is currently a function of (host toolchain versions, network state, the resolved generator, date) — none recorded.
collectMetadatanow captures:bfGeneratorVersion— the resolvedcreate-better-fullstack@latestversion actually exercised by the assisted paths (gitHeadonly describes the local checkout). Best-effort vianpm view;undefinedoffline.toolchains— hostrustc/cargo/go/dotnet/python/uv/protoc/psqlversions, since validation runs non-frozen network installs against them.environmentQualified: true— flags pass/fail as environment-dependent.Validation
bun run scaffbench:2.1:matrix→summary.jsonmetadata showsbfGeneratorVersion: 2.1.1and all local toolchain versions (additive; collected once at startup, best-effort).Benchmark precedent
SWE-bench pins the exact repo commit + harness version per run; Artificial Analysis pins each result to a dated snapshot. Reproducible benchmarks never test a
package@latestwhose version they don't record.