Skip to content

ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets#258

Merged
Marve10s merged 1 commit into
Developmentfrom
ibrahim/scaffbench-2.1-p1-discovery
Jun 25, 2026
Merged

ScaffBench 2.1: P1 — discovery lane + per-capability acceptance sets#258
Marve10s merged 1 commit into
Developmentfrom
ibrahim/scaffbench-2.1-p1-discovery

Conversation

@Marve10s

Copy link
Copy Markdown
Owner

Summary

Phase — the construct-validity centerpiece. Stacked on #257; retargets to Development as the stack merges.

Closes the de-leak/scoring coupling the review flagged: with the natural lane scored against one fixed canonical id, reasonable alternatives were mis-penalized. Now:

  • For specs with curated acceptanceSets, the natural prompt drops the library names — the agent infers them from the described capabilities (true discovery). Specs without acceptance sets keep their explicit notes, so nothing is mis-penalized.
  • scoreAcceptance() credits a capability when any accepted library is wired (semantic-search ∈ {qdrant, pgvector, weaviate, …}), reported as a new Acceptance column alongside the strict canonical Wired score.
  • ai-search-workbench curated with 12 capability sets; promptStyle threaded to scoreProject; RunResult.acceptanceScore + aggregate acceptancePercent added.

The other four specs keep explicit notes in the natural lane until their acceptance sets are curated (follow-up) — the mechanism is in place.

Validation

  • bun test scripts/scaffbench-v2-lib.test.ts28 pass (+2: natural lane omits names only for curated specs; pgvector/meilisearch/bullmq alternatives satisfy all 12 capabilities; explicit lane returns no acceptance score).
  • bun run scaffbench:2.1:matrix render verified (Acceptance column). Type-check unchanged.

Benchmark precedent

MMLU/AA never embed the answer in the prompt; SWE-bench accepts equivalent-but-different solutions rather than one canonical patch.

Closes the de-leak/scoring coupling the review flagged: with the canonical
command and (for explicit lanes) rightLibraryNotes still present, the natural
lane was scored against one fixed canonical id, mis-penalizing reasonable
alternatives.

- For specs with curated `acceptanceSets`, the natural prompt drops the library
  names (the agent infers them from capabilities); other specs keep their notes
  until curated, so nothing is mis-penalized.
- scoreAcceptance() credits a capability when ANY accepted library is wired
  (semantic-search ∈ {qdrant, pgvector, weaviate, …}); reported as a new
  `Acceptance` column alongside the strict canonical `Wired` score.
- ai-search-workbench curated with 12 capability sets; promptStyle threaded to
  scoreProject; RunResult gains acceptanceScore; aggregate gains acceptancePercent.

+2 tests (natural lane omits names only for curated specs; pgvector/meilisearch/
bullmq alternatives satisfy all 12 capabilities). 28 pass; type-check unchanged.
@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
better-fullstack-web Ready Ready Preview, Comment Jun 25, 2026 4:29pm

@github-actions github-actions Bot added size:L 100-499 effective changed lines (test files excluded in mixed PRs). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Jun 25, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d37edb6d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1886 to +1888
const satisfied = accepted.some(
(pattern) => index.dependencies.has(pattern) || haystack.includes(pattern),
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match acceptance libraries by token, not substring

For natural ai-search runs, this substring scan over the entire project text can credit capabilities that are not wired because the new acceptance sets include very short/broad tokens such as ai, vite, and next. For example, a project with Tailwind but no AI SDK still contains ai in tailwindcss, and a project with Vitest can satisfy the web-framework vite alternative, inflating the Acceptance column even when those capabilities are missing. Please restrict text matching to dependency/import/file tokens or use boundaries for these alternatives.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.

Comment on lines 1187 to 1189
const scored = generatedDir
? await scoreProject(spec, generatedDir)
? await scoreProject(spec, generatedDir, options.promptStyle)
: { artifact: emptyArtifactScore(spec), faithfulness: undefined };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Score missing projects as zero acceptance

When a natural discovery run fails to produce a project directory, the run gets a 0 artifact score but no acceptanceScore; aggregateBy then uses maybeAverage, which drops undefined values from the Acceptance average. In cells with any project-not-found/model-failure runs, the reported acceptance percentage is therefore computed only over runs that produced a directory, overstating capability satisfaction instead of counting those runs as 0/12.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in follow-up #261: a discovery run that produces no project now scores 0 acceptance (emptyAcceptanceScore) instead of undefined, so maybeAverage counts it rather than dropping it.

@Marve10s Marve10s changed the base branch from ibrahim/scaffbench-2.1-p1-metadata to Development June 25, 2026 16:43
@Marve10s Marve10s merged commit e6d6f59 into Development Jun 25, 2026
19 checks passed
Marve10s added a commit that referenced this pull request Jun 25, 2026
…llowup

ScaffBench 2.1: address Codex feedback on #258/#259
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100-499 effective changed lines (test files excluded in mixed PRs). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant