feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5) by explosivebit · Pull Request #53 · ForgePlan/pollmevals

explosivebit · 2026-06-03T14:41:49Z

What

Second real harness after aider — goose (Block's agentic CLI). Model-agnostic
(OpenAI-compatible), so it runs the same coder models as aider → a clean
L2-vs-L4 harness comparison on identical models (raw-llm L0 | goose L2 | aider L4).

Result (live on the board)

board.json now 22/25 cells scored. New goose column on be_01_jwt_auth:

model	goose score
qwen3-coder-30b	7.17
devstral	7.12
qwen-3-14b	6.0
codestral	failed (shown honestly, not hidden)

How (the proven aider pattern, per harness)

Image infra/docker/harness-goose: pinned goose v1.36.0 binary fetched
directly (the official install script falls back to a mirror lacking the arm64
asset). Non-root, GOOSE_DISABLE_KEYRING, GOOSE_MODE=auto. Built +
isolation-smoked end-to-end (goose → proxy → wrote a file in the sandbox).
Recipe _goose_invocation: env GOOSE_PROVIDER/GOOSE_MODEL + goose's
split OPENAI_HOST / OPENAI_BASE_PATH; prompt via -t. Promoted goose
proven → out of _PENDING_RECIPES.
stack.yaml: --with-builtin developer (file tools) + loop bounds.
--add-stack: runs ONE harness's grid and merges its column (cells +
harness metadata) into the existing board.json — no re-spend on the rest.

Honest unmetered cost

goose is a black-box CLI that doesn't self-report tokens, so its harness cost is
unmetered (aider's token-regex doesn't apply). Rather than a misleading "$0 /
free", cost `0` renders as "—" and the cell is excluded from the cost axis /
Pareto frontier / "cheapest" highlight. The real fix — proxy-side cost
reconciliation (LiteLLM `/spend/logs` already returns real per-request tokens
by `model_group`; needs pagination + flush-delay handling) — is a follow-up that
meters every black-box harness (goose, codex, claude-code).

Verify

728 eval-core tests green; ruff + mypy --strict clean; 12/12 stack specs valid.
Site build + Playwright verify clean: goose column renders, no hydration error.

Refs: rfc-006-stack-executor

🤖 Generated with Claude Code

…ase 5) Second harness after aider. goose is model-agnostic (OpenAI-compatible provider), so it runs the same coder models — the cleanest "swap the harness, hold the model" comparison column (raw-llm L0 | goose L2 | aider L4). - infra/docker/harness-goose: pinned goose v1.36.0 binary, fetched directly (the official install script falls back to a mirror that lacks the arm64 asset). Non-root, GOOSE_DISABLE_KEYRING, GOOSE_MODE=auto. Built + isolation- smoked end-to-end (goose run → proxy → wrote a file in the sandbox workspace). - _goose_invocation recipe in stack_executor.py: env GOOSE_PROVIDER/GOOSE_MODEL + split OPENAI_HOST/OPENAI_BASE_PATH (goose's endpoint shape), prompt via -t. Promoted goose proven → removed from _PENDING_RECIPES. - stacks/goose/stack.yaml: --with-builtin developer (file tools) + loop bounds. - tests: goose recipe-proven + trailing-slash; supported_harnesses={aider,goose}. 728 tests green; ruff + mypy --strict clean; 12/12 stack specs valid. Refs: rfc-006-stack-executor Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Runs the goose grid (--add-stack goose: goose × {qwen-3-14b, qwen3-coder-30b, codestral, devstral} × be_01, 2 seeds) and merges the new harness column into board.json without re-spending on raw-llm/aider. Board now 22/25 scored; goose × {qwen3-coder-30b 7.17, devstral 7.12, qwen-3-14b 6.0} — a clean L2-vs-L4 harness comparison on identical models (codestral failed, shown honestly). Honest cost: goose is a black-box CLI that doesn't self-report tokens, so its harness cost is unmetered (aider's token-regex doesn't apply). Rather than render a misleading "$0 / free": - formatCost(): cost 0 → "—" (a metered eval is never exactly $0). - metricValue(cost): 0 → null, so the matrix heat / best / range skip it. - frontierKeys + Pareto: exclude cost 0 (can't place on the cost axis honestly). - matrix tooltip / master table / drawer / per-task winners use formatCost. Real harness cost lands via proxy-side reconciliation (LiteLLM /spend/logs already returns real per-request tokens by model_group; needs pagination + flush-delay handling) — a follow-up that fixes cost for every black-box harness. --add-stack merges ONE harness column (cells + harness metadata) into an existing board.json, so future harnesses don't re-spend on the rest. Site build + Playwright verify clean (goose column renders, no hydration error). Refs: rfc-006-stack-executor Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#58) Pick-up doc for a fresh session. Continues HANDOFF-2026-06-03.md (which named the harness batch as next). This session shipped it: 2 → 8 harness columns (goose/opencode/crush/cline/PI/gptme/mini-swe, PRs #53-57), board 32/45, the config_files launcher, parallelization-via-agents, and THE tool-call-parser finding (native vs text-format tool_calls → explains the whole matrix). Lays out the v0.2 phase: ADR (stronger/more judges + linting/typing criteria) → --fill-missing mode → fill the grid + stronger OpenRouter models → vendor trio (codex-relay / claude-code×opus / gemini-2.5-pro×goose). + the harness-add pattern, dev workflow, every hard-won gotcha, key-files map, mission. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

explosivebit and others added 2 commits June 3, 2026 17:18

explosivebit merged commit 98bc5fe into main Jun 3, 2026
6 checks passed

explosivebit deleted the feat/harness-goose branch June 3, 2026 14:42

explosivebit mentioned this pull request Jun 3, 2026

docs(handoff): 2026-06-03 harness batch — 8 columns, v0.2 plan #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5)#53

feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5)#53
explosivebit merged 2 commits into
mainfrom
feat/harness-goose

explosivebit commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

explosivebit commented Jun 3, 2026

What

Result (live on the board)

How (the proven aider pattern, per harness)

Honest unmetered cost

Verify

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant