feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing#59
Merged
Merged
Conversation
Adds --fill-missing flag that runs only the (model, stack) cells absent from the current board.json and merges them in, avoiding re-spend on cells that are already present. Key changes: - Extract _merge_into_board() shared helper so --add-stack and --fill-missing use identical merge logic (no duplication) - Add _compute_gap() that computes desired - present from _STACK_MODELS and board.json; supports optional --stacks subset filter - --fill-missing --dry-run prints gap grouped by stack and exits 0 without spending; --confirm-spend required for real runs - 17 new tests in test_fill_missing.py covering gap computation, merge idempotency, and dry-run exit codes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c lint/typing User decision (2026-06-03): judge candidate code on the 5 most powerful June-2026 models, 5 distinct vendor families — Claude Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro · Grok 4 · DeepSeek V4 Pro. Diversity over count (prior-art: frontier judges show Krippendorff α 0.80-0.91 vs our 2-judge 0.17-0.36; family diversity beats raw count for de-biasing). - infra/litellm-config.yaml: 5 *-judge routes billed to OPENROUTER_API_KEY_JUDGE (NFR-005), reasoning_effort=low so the rubric JSON fits the 2048-tok cap. Verified: proxy reload OK, all 5 register, readiness 200. - judge_panel.py: _FAMILY_ALIASES += xai/deepseek/minimax so the self-judging guard correctly normalises grok-4 / deepseek-* (previously fell through to raw names → a grok-judges-grok clash would slip past the guard). - build_real_board.py: _JUDGES → the 5 frontier roster; panel rubric_version 2.0. - be_01 rubric.yaml: type_safety (judge criterion 0.15) → design_appropriateness. Resolves the dual-path double-count — lint/typing stay DETERMINISTIC (eslint/tsc); judges score only what a compiler can't (design/idioms/boundaries). rubric_version 2.0. Follow-up: per-eval self-judging exclusion to re-admit grok-4/deepseek-* AS candidates (currently reference/judge tier). Live route ping pending a funded LITELLM_MASTER_KEY in the shell env. 750 tests pass. Refs: methodology-v0.2 (ADR pending). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline methodology v0.2: lint (eslint/ruff) and type_safety (tsc) now flow into the coding-task final_score as DETERMINISTIC components (0.10 each, frozen weights), so "linting + typing" count for real — judges no longer grade them (the be_01 rubric's type_safety criterion became design_appropriateness in the prior commit; no double-count). - auto_metrics.py (new): run_auto_evaluators() runs LintEvaluator + TypeSafetyEvaluator concurrently on the harness's real file-tree snapshot; compute_final_score() applies the frozen formula (0.40 correctness + 0.15 coverage + 0.10 complexity + 0.10 lint + 0.10 type_safety + 0.15 pattern_match), reading auto metrics for the deterministic slots and the judge median for pattern_match. Skipped evaluators contribute 0.0 (no invented scores). - stack_caller.py: after a harness run, evaluate repo_snapshot_dir (the real files the harness wrote, not the text blob) → automatic_metrics. - grid_runner.py: after JudgePanel.aggregate, compute + set final_score. - lint_evaluator.py: drop the wrong "be_" → python map entry (be_01 is TypeScript/Express; ruff-on-TS would silently score 0). Extension scan handles it. Verified: 766 tests pass (incl. 16 new auto_metrics tests). Salvaged + verified from a parallel build agent's worktree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ic lint/typing Documents the decision implemented in this branch (refines PRD-002, ADR-005): 5 frontier judges (Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro / Grok 4 / DeepSeek V4 Pro), lint/typing deterministic, judges score subjective axes only. New MethodologyVersion v0.2.0 (re-score on new runs; old runs immutable per ADR-0002). Draft — MUST-clean; add Invariants/Rollback/Affected-Files before activation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Methodology v0.2 infra wave: the judge panel becomes the 5 most powerful June-2026 models (5 distinct vendor families), linting/typing become deterministic scored components, and the board gains a
--fill-missingmode. 766 tests green.What's in this PR (3 commits)
--fill-missing— run only the un-scored(model,stack)grid gaps instead of whole columns (--add-stackre-ran everything = wasteful spend).--dry-runpreviews the gap set. + tests.-judgeroutes inlitellm-config.yaml(billed toOPENROUTER_API_KEY_JUDGE,reasoning_effort=low). Proxy reload OK, all 5 register, readiness 200._FAMILY_ALIASES+= xai/deepseek/minimax (self-judging-guard correctness — grok/deepseek previously fell through to raw names).be_01rubric:type_safety(judge criterion 0.15) →design_appropriateness— resolves the dual-path double-count;rubric_version2.0.auto_metrics.pyruns eslint/ruff + tsc on the harness's real file tree and feeds the frozen coding formula (0.10 lint + 0.10 type_safety). Also fixes a latentbe_→python lint-lang bug (ruff-on-TS silently scored 0).Why
Our 2-judge panel gave Krippendorff α 0.17–0.36; prior-art shows frontier judges reach 0.80–0.91 and family diversity beats raw count for de-biasing. Linting/typing stay compiler-decided (objective) while judges score only what a compiler can't (design/idioms/boundaries) — no double-count.
Verification
LITELLM_MASTER_KEYin-shell — slugs confirmed via OpenRouter pages;grok-4.20+deepseek-v4-promirror proven candidate routes.Not done / follow-ups
Refs: prd-002 (methodology-v0.2)
🤖 Generated with Claude Code