feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing by explosivebit · Pull Request #59 · ForgePlan/pollmevals

explosivebit · 2026-06-03T20:58:08Z

TL;DR

Methodology v0.2 infra wave: the judge panel becomes the 5 most powerful June-2026 models (5 distinct vendor families), linting/typing become deterministic scored components, and the board gains a --fill-missing mode. 766 tests green.

What's in this PR (3 commits)

board --fill-missing — run only the un-scored (model,stack) grid gaps instead of whole columns (--add-stack re-ran everything = wasteful spend). --dry-run previews the gap set. + tests.
5-frontier judge panel (user decision) — Claude Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro · Grok 4 · DeepSeek V4 Pro, 5 distinct families.
- 5 -judge routes in litellm-config.yaml (billed to OPENROUTER_API_KEY_JUDGE, reasoning_effort=low). Proxy reload OK, all 5 register, readiness 200.
- _FAMILY_ALIASES += xai/deepseek/minimax (self-judging-guard correctness — grok/deepseek previously fell through to raw names).
- be_01 rubric: type_safety (judge criterion 0.15) → design_appropriateness — resolves the dual-path double-count; rubric_version 2.0.
deterministic lint + type_safety in the scored pipeline — new auto_metrics.py runs eslint/ruff + tsc on the harness's real file tree and feeds the frozen coding formula (0.10 lint + 0.10 type_safety). Also fixes a latent be_→python lint-lang bug (ruff-on-TS silently scored 0).

Why

Our 2-judge panel gave Krippendorff α 0.17–0.36; prior-art shows frontier judges reach 0.80–0.91 and family diversity beats raw count for de-biasing. Linting/typing stay compiler-decided (objective) while judges score only what a compiler can't (design/idioms/boundaries) — no double-count.

Verification

766 tests pass (+16 new auto_metrics, + fill-missing tests).
Proxy: 5 judge routes register, readiness 200. ⚠️ Live route ping pending a funded LITELLM_MASTER_KEY in-shell — slugs confirmed via OpenRouter pages; grok-4.20 + deepseek-v4-pro mirror proven candidate routes.

Not done / follow-ups

methodology-v0.2 ADR (refines PRD-002 / ADR-005) → new MethodologyVersion v0.2.0 (re-score on new runs; old runs stay v0.1.0 per ADR-0002).
Per-eval self-judging exclusion to re-admit grok-4 / deepseek-* AS candidates (currently reference/judge tier).
Board re-score under the v0.2 panel (~$5–15) — spend-gated.
codex harness needs a relay sidecar (WebSocket→api.openai.com); goose×gemini re-smoke (worktree path bug).

Refs: prd-002 (methodology-v0.2)

🤖 Generated with Claude Code

Adds --fill-missing flag that runs only the (model, stack) cells absent from the current board.json and merges them in, avoiding re-spend on cells that are already present. Key changes: - Extract _merge_into_board() shared helper so --add-stack and --fill-missing use identical merge logic (no duplication) - Add _compute_gap() that computes desired - present from _STACK_MODELS and board.json; supports optional --stacks subset filter - --fill-missing --dry-run prints gap grouped by stack and exits 0 without spending; --confirm-spend required for real runs - 17 new tests in test_fill_missing.py covering gap computation, merge idempotency, and dry-run exit codes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…c lint/typing User decision (2026-06-03): judge candidate code on the 5 most powerful June-2026 models, 5 distinct vendor families — Claude Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro · Grok 4 · DeepSeek V4 Pro. Diversity over count (prior-art: frontier judges show Krippendorff α 0.80-0.91 vs our 2-judge 0.17-0.36; family diversity beats raw count for de-biasing). - infra/litellm-config.yaml: 5 *-judge routes billed to OPENROUTER_API_KEY_JUDGE (NFR-005), reasoning_effort=low so the rubric JSON fits the 2048-tok cap. Verified: proxy reload OK, all 5 register, readiness 200. - judge_panel.py: _FAMILY_ALIASES += xai/deepseek/minimax so the self-judging guard correctly normalises grok-4 / deepseek-* (previously fell through to raw names → a grok-judges-grok clash would slip past the guard). - build_real_board.py: _JUDGES → the 5 frontier roster; panel rubric_version 2.0. - be_01 rubric.yaml: type_safety (judge criterion 0.15) → design_appropriateness. Resolves the dual-path double-count — lint/typing stay DETERMINISTIC (eslint/tsc); judges score only what a compiler can't (design/idioms/boundaries). rubric_version 2.0. Follow-up: per-eval self-judging exclusion to re-admit grok-4/deepseek-* AS candidates (currently reference/judge tier). Live route ping pending a funded LITELLM_MASTER_KEY in the shell env. 750 tests pass. Refs: methodology-v0.2 (ADR pending). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eline methodology v0.2: lint (eslint/ruff) and type_safety (tsc) now flow into the coding-task final_score as DETERMINISTIC components (0.10 each, frozen weights), so "linting + typing" count for real — judges no longer grade them (the be_01 rubric's type_safety criterion became design_appropriateness in the prior commit; no double-count). - auto_metrics.py (new): run_auto_evaluators() runs LintEvaluator + TypeSafetyEvaluator concurrently on the harness's real file-tree snapshot; compute_final_score() applies the frozen formula (0.40 correctness + 0.15 coverage + 0.10 complexity + 0.10 lint + 0.10 type_safety + 0.15 pattern_match), reading auto metrics for the deterministic slots and the judge median for pattern_match. Skipped evaluators contribute 0.0 (no invented scores). - stack_caller.py: after a harness run, evaluate repo_snapshot_dir (the real files the harness wrote, not the text blob) → automatic_metrics. - grid_runner.py: after JudgePanel.aggregate, compute + set final_score. - lint_evaluator.py: drop the wrong "be_" → python map entry (be_01 is TypeScript/Express; ruff-on-TS would silently score 0). Extension scan handles it. Verified: 766 tests pass (incl. 16 new auto_metrics tests). Salvaged + verified from a parallel build agent's worktree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ic lint/typing Documents the decision implemented in this branch (refines PRD-002, ADR-005): 5 frontier judges (Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro / Grok 4 / DeepSeek V4 Pro), lint/typing deterministic, judges score subjective axes only. New MethodologyVersion v0.2.0 (re-score on new runs; old runs immutable per ADR-0002). Draft — MUST-clean; add Invariants/Rollback/Affected-Files before activation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

explosivebit and others added 4 commits June 3, 2026 22:57

explosivebit merged commit 0158f77 into main Jun 3, 2026

explosivebit deleted the feat/v0.2-infra-wave branch June 3, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing#59

feat: methodology v0.2 — 5-frontier judge panel + deterministic lint/typing + board --fill-missing#59
explosivebit merged 4 commits into
mainfrom
feat/v0.2-infra-wave

explosivebit commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

explosivebit commented Jun 3, 2026

TL;DR

What's in this PR (3 commits)

Why

Verification

Not done / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant