Skip to content

feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5)#53

Merged
explosivebit merged 2 commits into
mainfrom
feat/harness-goose
Jun 3, 2026
Merged

feat(harness): goose CLI harness — real L2 matrix column (RFC-006 Phase 5)#53
explosivebit merged 2 commits into
mainfrom
feat/harness-goose

Conversation

@explosivebit

Copy link
Copy Markdown
Contributor

What

Second real harness after aider — goose (Block's agentic CLI). Model-agnostic
(OpenAI-compatible), so it runs the same coder models as aider → a clean
L2-vs-L4 harness comparison on identical models (raw-llm L0 | goose L2 | aider L4).

Result (live on the board)

board.json now 22/25 cells scored. New goose column on be_01_jwt_auth:

model goose score
qwen3-coder-30b 7.17
devstral 7.12
qwen-3-14b 6.0
codestral failed (shown honestly, not hidden)

How (the proven aider pattern, per harness)

  • Image infra/docker/harness-goose: pinned goose v1.36.0 binary fetched
    directly (the official install script falls back to a mirror lacking the arm64
    asset). Non-root, GOOSE_DISABLE_KEYRING, GOOSE_MODE=auto. Built +
    isolation-smoked end-to-end (goose → proxy → wrote a file in the sandbox).
  • Recipe _goose_invocation: env GOOSE_PROVIDER/GOOSE_MODEL + goose's
    split OPENAI_HOST / OPENAI_BASE_PATH; prompt via -t. Promoted goose
    proven → out of _PENDING_RECIPES.
  • stack.yaml: --with-builtin developer (file tools) + loop bounds.
  • --add-stack: runs ONE harness's grid and merges its column (cells +
    harness metadata) into the existing board.json — no re-spend on the rest.

Honest unmetered cost

goose is a black-box CLI that doesn't self-report tokens, so its harness cost is
unmetered (aider's token-regex doesn't apply). Rather than a misleading "$0 /
free", cost `0` renders as "—" and the cell is excluded from the cost axis /
Pareto frontier / "cheapest" highlight. The real fix — proxy-side cost
reconciliation
(LiteLLM `/spend/logs` already returns real per-request tokens
by `model_group`; needs pagination + flush-delay handling) — is a follow-up that
meters every black-box harness (goose, codex, claude-code).

Verify

  • 728 eval-core tests green; ruff + mypy --strict clean; 12/12 stack specs valid.
  • Site build + Playwright verify clean: goose column renders, no hydration error.

Refs: rfc-006-stack-executor

🤖 Generated with Claude Code

explosivebit and others added 2 commits June 3, 2026 17:18
…ase 5)

Second harness after aider. goose is model-agnostic (OpenAI-compatible
provider), so it runs the same coder models — the cleanest "swap the harness,
hold the model" comparison column (raw-llm L0 | goose L2 | aider L4).

- infra/docker/harness-goose: pinned goose v1.36.0 binary, fetched directly
  (the official install script falls back to a mirror that lacks the arm64
  asset). Non-root, GOOSE_DISABLE_KEYRING, GOOSE_MODE=auto. Built + isolation-
  smoked end-to-end (goose run → proxy → wrote a file in the sandbox workspace).
- _goose_invocation recipe in stack_executor.py: env GOOSE_PROVIDER/GOOSE_MODEL
  + split OPENAI_HOST/OPENAI_BASE_PATH (goose's endpoint shape), prompt via -t.
  Promoted goose proven → removed from _PENDING_RECIPES.
- stacks/goose/stack.yaml: --with-builtin developer (file tools) + loop bounds.
- tests: goose recipe-proven + trailing-slash; supported_harnesses={aider,goose}.

728 tests green; ruff + mypy --strict clean; 12/12 stack specs valid.

Refs: rfc-006-stack-executor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Runs the goose grid (--add-stack goose: goose × {qwen-3-14b, qwen3-coder-30b,
codestral, devstral} × be_01, 2 seeds) and merges the new harness column into
board.json without re-spending on raw-llm/aider. Board now 22/25 scored; goose ×
{qwen3-coder-30b 7.17, devstral 7.12, qwen-3-14b 6.0} — a clean L2-vs-L4 harness
comparison on identical models (codestral failed, shown honestly).

Honest cost: goose is a black-box CLI that doesn't self-report tokens, so its
harness cost is unmetered (aider's token-regex doesn't apply). Rather than render
a misleading "$0 / free":
- formatCost(): cost 0 → "—" (a metered eval is never exactly $0).
- metricValue(cost): 0 → null, so the matrix heat / best / range skip it.
- frontierKeys + Pareto: exclude cost 0 (can't place on the cost axis honestly).
- matrix tooltip / master table / drawer / per-task winners use formatCost.

Real harness cost lands via proxy-side reconciliation (LiteLLM /spend/logs
already returns real per-request tokens by model_group; needs pagination +
flush-delay handling) — a follow-up that fixes cost for every black-box harness.

--add-stack merges ONE harness column (cells + harness metadata) into an existing
board.json, so future harnesses don't re-spend on the rest.

Site build + Playwright verify clean (goose column renders, no hydration error).

Refs: rfc-006-stack-executor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@explosivebit explosivebit merged commit 98bc5fe into main Jun 3, 2026
6 checks passed
@explosivebit explosivebit deleted the feat/harness-goose branch June 3, 2026 14:42
explosivebit added a commit that referenced this pull request Jun 3, 2026
#58)

Pick-up doc for a fresh session. Continues HANDOFF-2026-06-03.md (which named the
harness batch as next). This session shipped it: 2 → 8 harness columns
(goose/opencode/crush/cline/PI/gptme/mini-swe, PRs #53-57), board 32/45, the
config_files launcher, parallelization-via-agents, and THE tool-call-parser
finding (native vs text-format tool_calls → explains the whole matrix). Lays out
the v0.2 phase: ADR (stronger/more judges + linting/typing criteria) →
--fill-missing mode → fill the grid + stronger OpenRouter models → vendor trio
(codex-relay / claude-code×opus / gemini-2.5-pro×goose). + the harness-add
pattern, dev workflow, every hard-won gotcha, key-files map, mission.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant