Skip to content

feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing#59

Merged
explosivebit merged 4 commits into
mainfrom
feat/v0.2-infra-wave
Jun 3, 2026
Merged

feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing#59
explosivebit merged 4 commits into
mainfrom
feat/v0.2-infra-wave

Conversation

@explosivebit

Copy link
Copy Markdown
Contributor

TL;DR

Methodology v0.2 infra wave: the judge panel becomes the 5 most powerful June-2026 models (5 distinct vendor families), linting/typing become deterministic scored components, and the board gains a --fill-missing mode. 766 tests green.

What's in this PR (3 commits)

  1. board --fill-missing — run only the un-scored (model,stack) grid gaps instead of whole columns (--add-stack re-ran everything = wasteful spend). --dry-run previews the gap set. + tests.
  2. 5-frontier judge panel (user decision) — Claude Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro · Grok 4 · DeepSeek V4 Pro, 5 distinct families.
    • 5 -judge routes in litellm-config.yaml (billed to OPENROUTER_API_KEY_JUDGE, reasoning_effort=low). Proxy reload OK, all 5 register, readiness 200.
    • _FAMILY_ALIASES += xai/deepseek/minimax (self-judging-guard correctness — grok/deepseek previously fell through to raw names).
    • be_01 rubric: type_safety (judge criterion 0.15) → design_appropriateness — resolves the dual-path double-count; rubric_version 2.0.
  3. deterministic lint + type_safety in the scored pipeline — new auto_metrics.py runs eslint/ruff + tsc on the harness's real file tree and feeds the frozen coding formula (0.10 lint + 0.10 type_safety). Also fixes a latent be_→python lint-lang bug (ruff-on-TS silently scored 0).

Why

Our 2-judge panel gave Krippendorff α 0.17–0.36; prior-art shows frontier judges reach 0.80–0.91 and family diversity beats raw count for de-biasing. Linting/typing stay compiler-decided (objective) while judges score only what a compiler can't (design/idioms/boundaries) — no double-count.

Verification

  • 766 tests pass (+16 new auto_metrics, + fill-missing tests).
  • Proxy: 5 judge routes register, readiness 200. ⚠️ Live route ping pending a funded LITELLM_MASTER_KEY in-shell — slugs confirmed via OpenRouter pages; grok-4.20 + deepseek-v4-pro mirror proven candidate routes.

Not done / follow-ups

  • methodology-v0.2 ADR (refines PRD-002 / ADR-005) → new MethodologyVersion v0.2.0 (re-score on new runs; old runs stay v0.1.0 per ADR-0002).
  • Per-eval self-judging exclusion to re-admit grok-4 / deepseek-* AS candidates (currently reference/judge tier).
  • Board re-score under the v0.2 panel (~$5–15) — spend-gated.
  • codex harness needs a relay sidecar (WebSocket→api.openai.com); goose×gemini re-smoke (worktree path bug).

Refs: prd-002 (methodology-v0.2)

🤖 Generated with Claude Code

explosivebit and others added 4 commits June 3, 2026 22:57
Adds --fill-missing flag that runs only the (model, stack) cells absent
from the current board.json and merges them in, avoiding re-spend on
cells that are already present.

Key changes:
- Extract _merge_into_board() shared helper so --add-stack and
  --fill-missing use identical merge logic (no duplication)
- Add _compute_gap() that computes desired - present from _STACK_MODELS
  and board.json; supports optional --stacks subset filter
- --fill-missing --dry-run prints gap grouped by stack and exits 0
  without spending; --confirm-spend required for real runs
- 17 new tests in test_fill_missing.py covering gap computation,
  merge idempotency, and dry-run exit codes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c lint/typing

User decision (2026-06-03): judge candidate code on the 5 most powerful June-2026
models, 5 distinct vendor families — Claude Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro ·
Grok 4 · DeepSeek V4 Pro. Diversity over count (prior-art: frontier judges show
Krippendorff α 0.80-0.91 vs our 2-judge 0.17-0.36; family diversity beats raw
count for de-biasing).

- infra/litellm-config.yaml: 5 *-judge routes billed to OPENROUTER_API_KEY_JUDGE
  (NFR-005), reasoning_effort=low so the rubric JSON fits the 2048-tok cap.
  Verified: proxy reload OK, all 5 register, readiness 200.
- judge_panel.py: _FAMILY_ALIASES += xai/deepseek/minimax so the self-judging
  guard correctly normalises grok-4 / deepseek-* (previously fell through to raw
  names → a grok-judges-grok clash would slip past the guard).
- build_real_board.py: _JUDGES → the 5 frontier roster; panel rubric_version 2.0.
- be_01 rubric.yaml: type_safety (judge criterion 0.15) → design_appropriateness.
  Resolves the dual-path double-count — lint/typing stay DETERMINISTIC (eslint/tsc);
  judges score only what a compiler can't (design/idioms/boundaries). rubric_version 2.0.

Follow-up: per-eval self-judging exclusion to re-admit grok-4/deepseek-* AS
candidates (currently reference/judge tier). Live route ping pending a funded
LITELLM_MASTER_KEY in the shell env.

750 tests pass. Refs: methodology-v0.2 (ADR pending).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline

methodology v0.2: lint (eslint/ruff) and type_safety (tsc) now flow into the
coding-task final_score as DETERMINISTIC components (0.10 each, frozen weights),
so "linting + typing" count for real — judges no longer grade them (the be_01
rubric's type_safety criterion became design_appropriateness in the prior commit;
no double-count).

- auto_metrics.py (new): run_auto_evaluators() runs LintEvaluator +
  TypeSafetyEvaluator concurrently on the harness's real file-tree snapshot;
  compute_final_score() applies the frozen formula (0.40 correctness +
  0.15 coverage + 0.10 complexity + 0.10 lint + 0.10 type_safety +
  0.15 pattern_match), reading auto metrics for the deterministic slots and the
  judge median for pattern_match. Skipped evaluators contribute 0.0 (no invented
  scores).
- stack_caller.py: after a harness run, evaluate repo_snapshot_dir (the real
  files the harness wrote, not the text blob) → automatic_metrics.
- grid_runner.py: after JudgePanel.aggregate, compute + set final_score.
- lint_evaluator.py: drop the wrong "be_" → python map entry (be_01 is
  TypeScript/Express; ruff-on-TS would silently score 0). Extension scan handles it.

Verified: 766 tests pass (incl. 16 new auto_metrics tests). Salvaged + verified
from a parallel build agent's worktree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ic lint/typing

Documents the decision implemented in this branch (refines PRD-002, ADR-005):
5 frontier judges (Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro / Grok 4 / DeepSeek V4 Pro),
lint/typing deterministic, judges score subjective axes only. New MethodologyVersion
v0.2.0 (re-score on new runs; old runs immutable per ADR-0002). Draft — MUST-clean;
add Invariants/Rollback/Affected-Files before activation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@explosivebit explosivebit merged commit 0158f77 into main Jun 3, 2026
@explosivebit explosivebit deleted the feat/v0.2-infra-wave branch June 3, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant