ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets#258
Conversation
Closes the de-leak/scoring coupling the review flagged: with the canonical
command and (for explicit lanes) rightLibraryNotes still present, the natural
lane was scored against one fixed canonical id, mis-penalizing reasonable
alternatives.
- For specs with curated `acceptanceSets`, the natural prompt drops the library
names (the agent infers them from capabilities); other specs keep their notes
until curated, so nothing is mis-penalized.
- scoreAcceptance() credits a capability when ANY accepted library is wired
(semantic-search ∈ {qdrant, pgvector, weaviate, …}); reported as a new
`Acceptance` column alongside the strict canonical `Wired` score.
- ai-search-workbench curated with 12 capability sets; promptStyle threaded to
scoreProject; RunResult gains acceptanceScore; aggregate gains acceptancePercent.
+2 tests (natural lane omits names only for curated specs; pgvector/meilisearch/
bullmq alternatives satisfy all 12 capabilities). 28 pass; type-check unchanged.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d37edb6d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const satisfied = accepted.some( | ||
| (pattern) => index.dependencies.has(pattern) || haystack.includes(pattern), | ||
| ); |
There was a problem hiding this comment.
Match acceptance libraries by token, not substring
For natural ai-search runs, this substring scan over the entire project text can credit capabilities that are not wired because the new acceptance sets include very short/broad tokens such as ai, vite, and next. For example, a project with Tailwind but no AI SDK still contains ai in tailwindcss, and a project with Vitest can satisfy the web-framework vite alternative, inflating the Acceptance column even when those capabilities are missing. Please restrict text matching to dependency/import/file tokens or use boundaries for these alternatives.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.
| const scored = generatedDir | ||
| ? await scoreProject(spec, generatedDir) | ||
| ? await scoreProject(spec, generatedDir, options.promptStyle) | ||
| : { artifact: emptyArtifactScore(spec), faithfulness: undefined }; |
There was a problem hiding this comment.
Score missing projects as zero acceptance
When a natural discovery run fails to produce a project directory, the run gets a 0 artifact score but no acceptanceScore; aggregateBy then uses maybeAverage, which drops undefined values from the Acceptance average. In cells with any project-not-found/model-failure runs, the reported acceptance percentage is therefore computed only over runs that produced a directory, overstating capability satisfaction instead of counting those runs as 0/12.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.
Summary
Phase — the construct-validity centerpiece. Stacked on #257; retargets to
Developmentas the stack merges.Closes the de-leak/scoring coupling the review flagged: with the natural lane scored against one fixed canonical id, reasonable alternatives were mis-penalized. Now:
acceptanceSets, the natural prompt drops the library names — the agent infers them from the described capabilities (true discovery). Specs without acceptance sets keep their explicit notes, so nothing is mis-penalized.scoreAcceptance()credits a capability when any accepted library is wired (semantic-search ∈ {qdrant, pgvector, weaviate, …}), reported as a newAcceptancecolumn alongside the strict canonicalWiredscore.ai-search-workbenchcurated with 12 capability sets;promptStylethreaded toscoreProject;RunResult.acceptanceScore+ aggregateacceptancePercentadded.The other four specs keep explicit notes in the natural lane until their acceptance sets are curated (follow-up) — the mechanism is in place.
Validation
bun test scripts/scaffbench-v2-lib.test.ts— 28 pass (+2: natural lane omits names only for curated specs; pgvector/meilisearch/bullmq alternatives satisfy all 12 capabilities; explicit lane returns no acceptance score).bun run scaffbench:2.1:matrixrender verified (Acceptance column). Type-check unchanged.Benchmark precedent
MMLU/AA never embed the answer in the prompt; SWE-bench accepts equivalent-but-different solutions rather than one canonical patch.