Skip to content

ScaffBench 2.1: P1 — reproducibility metadata (toolchains + generator version)#257

Merged
Marve10s merged 2 commits into
Developmentfrom
ibrahim/scaffbench-2.1-p1-metadata
Jun 25, 2026
Merged

ScaffBench 2.1: P1 — reproducibility metadata (toolchains + generator version)#257
Marve10s merged 2 commits into
Developmentfrom
ibrahim/scaffbench-2.1-p1-metadata

Conversation

@Marve10s

Copy link
Copy Markdown
Owner

Summary

Phase 5 — reproducibility metadata. Stacked on #256; retargets to Development as the stack merges.

A published pass/fail is currently a function of (host toolchain versions, network state, the resolved generator, date) — none recorded. collectMetadata now captures:

  • bfGeneratorVersion — the resolved create-better-fullstack@latest version actually exercised by the assisted paths (gitHead only describes the local checkout). Best-effort via npm view; undefined offline.
  • toolchains — host rustc/cargo/go/dotnet/python/uv/protoc/psql versions, since validation runs non-frozen network installs against them.
  • environmentQualified: true — flags pass/fail as environment-dependent.

Validation

  • End-to-end via bun run scaffbench:2.1:matrixsummary.json metadata shows bfGeneratorVersion: 2.1.1 and all local toolchain versions (additive; collected once at startup, best-effort).
  • 26 unit tests still pass; type-check unchanged. No effect on scoring/validation.

Benchmark precedent

SWE-bench pins the exact repo commit + harness version per run; Artificial Analysis pins each result to a dated snapshot. Reproducible benchmarks never test a package@latest whose version they don't record.

… version)

collectMetadata now records what's needed to qualify a published pass/fail:
- bfGeneratorVersion: the resolved create-better-fullstack@latest version
  actually exercised by the assisted paths (gitHead only describes the local
  checkout; best-effort via `npm view`, undefined offline).
- toolchains: host rustc/cargo/go/dotnet/python/uv/protoc/psql versions, since
  validation runs non-frozen network installs against them.
- environmentQualified: true — flags that pass/fail is environment-dependent.

Verified end-to-end via the matrix run (bfGeneratorVersion=2.1.1; all local
toolchains captured). Additive only — no effect on scoring/validation; 26 tests
still pass, type-check unchanged.
@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
better-fullstack-web Ready Ready Preview, Comment Jun 25, 2026 4:21pm

@github-actions github-actions Bot added size:M 30-99 effective changed lines (test files excluded in mixed PRs). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Jun 25, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c720478d8f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/scaffbench-v2-lib.ts Outdated
Comment on lines +2498 to +2501
const bfGeneratorVersion = await tryCommandText(
"npm",
["view", "create-better-fullstack@latest", "version"],
process.cwd(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin the recorded generator version

For long or resumed publishable runs, this records the latest dist-tag once at startup, but the CLI/MCP paths still invoke better-fullstack@latest / create-better-fullstack@latest later for each assisted run. If a new CLI is published while the matrix is running or before a resume, later assisted runs exercise a different package than bfGeneratorVersion, making the reproducibility metadata misleading. Either substitute this resolved version into the assisted commands/MCP config or record the resolved package version per run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by actually pinning the version. resolveBfVersion() resolves the @latest dist-tag once at run start (RESOLVED_BF_VERSION), and every harness-controlled assisted invocation — canonicalCommand, the MCP config, the doctor lane, and the CLI prompt instruction — now references that exact version via bfSpec() rather than @latest. Metadata reuses the same resolved value, so a publish mid-run can no longer make later runs diverge from bfGeneratorVersion. Verified: spec.json and the MCP config pin better-fullstack@2.1.1.

…dex #257)

Recording bfGeneratorVersion alone was misleading: the CLI/MCP paths still
invoked @latest per run, so a publish mid-matrix/resume would exercise a
different package than metadata claims. Now resolveBfVersion() resolves the
@latest dist-tag ONCE at run start (RESOLVED_BF_VERSION) and every
harness-controlled assisted invocation — canonicalCommand, the MCP config, the
doctor lane, and the CLI prompt instruction — references that pinned version via
bfSpec(). Falls back to "latest" offline; metadata reuses the same value (no
duplicate npm view). Verified: spec.json canonical commands and the MCP config
pin better-fullstack@2.1.1.
@Marve10s Marve10s changed the base branch from ibrahim/scaffbench-2.1-p2-index to Development June 25, 2026 16:19
@Marve10s Marve10s merged commit a31e839 into Development Jun 25, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M 30-99 effective changed lines (test files excluded in mixed PRs). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant