feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5)#53
Merged
Conversation
…ase 5)
Second harness after aider. goose is model-agnostic (OpenAI-compatible
provider), so it runs the same coder models — the cleanest "swap the harness,
hold the model" comparison column (raw-llm L0 | goose L2 | aider L4).
- infra/docker/harness-goose: pinned goose v1.36.0 binary, fetched directly
(the official install script falls back to a mirror that lacks the arm64
asset). Non-root, GOOSE_DISABLE_KEYRING, GOOSE_MODE=auto. Built + isolation-
smoked end-to-end (goose run → proxy → wrote a file in the sandbox workspace).
- _goose_invocation recipe in stack_executor.py: env GOOSE_PROVIDER/GOOSE_MODEL
+ split OPENAI_HOST/OPENAI_BASE_PATH (goose's endpoint shape), prompt via -t.
Promoted goose proven → removed from _PENDING_RECIPES.
- stacks/goose/stack.yaml: --with-builtin developer (file tools) + loop bounds.
- tests: goose recipe-proven + trailing-slash; supported_harnesses={aider,goose}.
728 tests green; ruff + mypy --strict clean; 12/12 stack specs valid.
Refs: rfc-006-stack-executor
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Runs the goose grid (--add-stack goose: goose × {qwen-3-14b, qwen3-coder-30b,
codestral, devstral} × be_01, 2 seeds) and merges the new harness column into
board.json without re-spending on raw-llm/aider. Board now 22/25 scored; goose ×
{qwen3-coder-30b 7.17, devstral 7.12, qwen-3-14b 6.0} — a clean L2-vs-L4 harness
comparison on identical models (codestral failed, shown honestly).
Honest cost: goose is a black-box CLI that doesn't self-report tokens, so its
harness cost is unmetered (aider's token-regex doesn't apply). Rather than render
a misleading "$0 / free":
- formatCost(): cost 0 → "—" (a metered eval is never exactly $0).
- metricValue(cost): 0 → null, so the matrix heat / best / range skip it.
- frontierKeys + Pareto: exclude cost 0 (can't place on the cost axis honestly).
- matrix tooltip / master table / drawer / per-task winners use formatCost.
Real harness cost lands via proxy-side reconciliation (LiteLLM /spend/logs
already returns real per-request tokens by model_group; needs pagination +
flush-delay handling) — a follow-up that fixes cost for every black-box harness.
--add-stack merges ONE harness column (cells + harness metadata) into an existing
board.json, so future harnesses don't re-spend on the rest.
Site build + Playwright verify clean (goose column renders, no hydration error).
Refs: rfc-006-stack-executor
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
explosivebit
added a commit
that referenced
this pull request
Jun 3, 2026
#58) Pick-up doc for a fresh session. Continues HANDOFF-2026-06-03.md (which named the harness batch as next). This session shipped it: 2 → 8 harness columns (goose/opencode/crush/cline/PI/gptme/mini-swe, PRs #53-57), board 32/45, the config_files launcher, parallelization-via-agents, and THE tool-call-parser finding (native vs text-format tool_calls → explains the whole matrix). Lays out the v0.2 phase: ADR (stronger/more judges + linting/typing criteria) → --fill-missing mode → fill the grid + stronger OpenRouter models → vendor trio (codex-relay / claude-code×opus / gemini-2.5-pro×goose). + the harness-add pattern, dev workflow, every hard-won gotcha, key-files map, mission. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Second real harness after aider — goose (Block's agentic CLI). Model-agnostic
(OpenAI-compatible), so it runs the same coder models as aider → a clean
L2-vs-L4 harness comparison on identical models (raw-llm L0 | goose L2 | aider L4).
Result (live on the board)
board.jsonnow 22/25 cells scored. New goose column onbe_01_jwt_auth:How (the proven aider pattern, per harness)
infra/docker/harness-goose: pinned goose v1.36.0 binary fetcheddirectly (the official install script falls back to a mirror lacking the arm64
asset). Non-root,
GOOSE_DISABLE_KEYRING,GOOSE_MODE=auto. Built +isolation-smoked end-to-end (goose → proxy → wrote a file in the sandbox).
_goose_invocation: envGOOSE_PROVIDER/GOOSE_MODEL+ goose'ssplit
OPENAI_HOST/OPENAI_BASE_PATH; prompt via-t. Promoted gooseproven → out of
_PENDING_RECIPES.--with-builtin developer(file tools) + loop bounds.--add-stack: runs ONE harness's grid and merges its column (cells +harness metadata) into the existing board.json — no re-spend on the rest.
Honest unmetered cost
goose is a black-box CLI that doesn't self-report tokens, so its harness cost is
unmetered (aider's token-regex doesn't apply). Rather than a misleading "$0 /
free", cost `0` renders as "—" and the cell is excluded from the cost axis /
Pareto frontier / "cheapest" highlight. The real fix — proxy-side cost
reconciliation (LiteLLM `/spend/logs` already returns real per-request tokens
by `model_group`; needs pagination + flush-delay handling) — is a follow-up that
meters every black-box harness (goose, codex, claude-code).
Verify
Refs: rfc-006-stack-executor
🤖 Generated with Claude Code