ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets by Marve10s · Pull Request #258 · Marve10s/Better-Fullstack

Marve10s · 2026-06-25T16:27:28Z

Summary

Phase — the construct-validity centerpiece. Stacked on #257; retargets to Development as the stack merges.

Closes the de-leak/scoring coupling the review flagged: with the natural lane scored against one fixed canonical id, reasonable alternatives were mis-penalized. Now:

For specs with curated acceptanceSets, the natural prompt drops the library names — the agent infers them from the described capabilities (true discovery). Specs without acceptance sets keep their explicit notes, so nothing is mis-penalized.
scoreAcceptance() credits a capability when any accepted library is wired (semantic-search ∈ {qdrant, pgvector, weaviate, …}), reported as a new Acceptance column alongside the strict canonical Wired score.
ai-search-workbench curated with 12 capability sets; promptStyle threaded to scoreProject; RunResult.acceptanceScore + aggregate acceptancePercent added.

The other four specs keep explicit notes in the natural lane until their acceptance sets are curated (follow-up) — the mechanism is in place.

Validation

bun test scripts/scaffbench-v2-lib.test.ts — 28 pass (+2: natural lane omits names only for curated specs; pgvector/meilisearch/bullmq alternatives satisfy all 12 capabilities; explicit lane returns no acceptance score).
bun run scaffbench:2.1:matrix render verified (Acceptance column). Type-check unchanged.

Benchmark precedent

MMLU/AA never embed the answer in the prompt; SWE-bench accepts equivalent-but-different solutions rather than one canonical patch.

Closes the de-leak/scoring coupling the review flagged: with the canonical command and (for explicit lanes) rightLibraryNotes still present, the natural lane was scored against one fixed canonical id, mis-penalizing reasonable alternatives. - For specs with curated `acceptanceSets`, the natural prompt drops the library names (the agent infers them from capabilities); other specs keep their notes until curated, so nothing is mis-penalized. - scoreAcceptance() credits a capability when ANY accepted library is wired (semantic-search ∈ {qdrant, pgvector, weaviate, …}); reported as a new `Acceptance` column alongside the strict canonical `Wired` score. - ai-search-workbench curated with 12 capability sets; promptStyle threaded to scoreProject; RunResult gains acceptanceScore; aggregate gains acceptancePercent. +2 tests (natural lane omits names only for curated specs; pgvector/meilisearch/ bullmq alternatives satisfy all 12 capabilities). 28 pass; type-check unchanged.

vercel · 2026-06-25T16:27:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
better-fullstack-web	Ready	Preview, Comment	Jun 25, 2026 4:29pm

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d37edb6d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T16:30:16Z

+    const satisfied = accepted.some(
+      (pattern) => index.dependencies.has(pattern) || haystack.includes(pattern),
+    );


Match acceptance libraries by token, not substring

For natural ai-search runs, this substring scan over the entire project text can credit capabilities that are not wired because the new acceptance sets include very short/broad tokens such as ai, vite, and next. For example, a project with Tailwind but no AI SDK still contains ai in tailwindcss, and a project with Vitest can satisfy the web-framework vite alternative, inflating the Acceptance column even when those capabilities are missing. Please restrict text matching to dependency/import/file tokens or use boundaries for these alternatives.

Useful? React with 👍 / 👎.

Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.

chatgpt-codex-connector · 2026-06-25T16:30:16Z

          const scored = generatedDir
-            ? await scoreProject(spec, generatedDir)
+            ? await scoreProject(spec, generatedDir, options.promptStyle)
            : { artifact: emptyArtifactScore(spec), faithfulness: undefined };


Score missing projects as zero acceptance

When a natural discovery run fails to produce a project directory, the run gets a 0 artifact score but no acceptanceScore; aggregateBy then uses maybeAverage, which drops undefined values from the Acceptance average. In cells with any project-not-found/model-failure runs, the reported acceptance percentage is therefore computed only over runs that produced a directory, overstating capability satisfaction instead of counting those runs as 0/12.

Useful? React with 👍 / 👎.

Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.

…llowup ScaffBench 2.1: address Codex feedback on #258/#259

github-actions Bot added size:L 100-499 effective changed lines (test files excluded in mixed PRs). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Jun 25, 2026

vercel Bot deployed to Preview June 25, 2026 16:29 View deployment

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

Marve10s mentioned this pull request Jun 25, 2026

ScaffBench 2.1: P1 — typecheck can't be dodged (tsc --noEmit fallback) #259

Merged

Marve10s changed the base branch from ibrahim/scaffbench-2.1-p1-metadata to Development June 25, 2026 16:43

Marve10s merged commit e6d6f59 into Development Jun 25, 2026
19 checks passed

Marve10s mentioned this pull request Jun 25, 2026

ScaffBench 2.1: address Codex feedback on #258/#259 #261

Merged

Marve10s added a commit that referenced this pull request Jun 25, 2026

Merge pull request #261 from Marve10s/ibrahim/scaffbench-2.1-codex-fo…

0e66e6f

…llowup ScaffBench 2.1: address Codex feedback on #258/#259

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets#258

ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets#258
Marve10s merged 1 commit into
Developmentfrom
ibrahim/scaffbench-2.1-p1-discovery

Marve10s commented Jun 25, 2026

Uh oh!

vercel Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

Marve10s Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

Marve10s Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Marve10s commented Jun 25, 2026

Summary

Validation

Benchmark precedent

Uh oh!

vercel Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Marve10s Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Marve10s Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 25, 2026 •

edited

Loading