feat(harness): PI + gptme + mini-SWE wave — 3 columns, board 32/45 (RFC-006 Phase 5) by explosivebit · Pull Request #57 · ForgePlan/pollmevals

explosivebit · 2026-06-03T17:42:40Z

What

Three model-agnostic harnesses — PI, gptme, mini-SWE-agent — built + smoked
in parallel by a 3-agent team (offloading per-harness Docker recipe-discovery),
integrated serially. Board now 32/45 scored, 8 harness columns.

Results (be_01)

harness	scored	notes
PI	3/4	devstral 7.46 · codestral 6.67 · qwen3-235b 6.17 (native tool_calls only)
gptme	1/4	codestral 7.17 (no turn cap → verify-loop timeouts elsewhere)
mini-SWE	2/4	qwen3-coder-30b 7.08 · qwen-3-14b 6.54 (textbased loop)

★ The tool-call-parser finding (explains the whole matrix)

A model's proxy/vLLM backend decides whether it emits native tool_calls or
text-format (XML/markdown fences). qwen3-coder-30b emits text → native-tool
harnesses (PI, opencode) no-op on it (this is why opencode is 1/4!). Native-tool
models on the proxy: devstral, codestral, qwen3-235b, glm-4-32b. Text-tolerant
harnesses (aider, goose, crush, gptme, mini-SWE-textbased) work on text-format models.

How

PI: models.json via config_files + PI_CODING_AGENT_DIR; native tool_calls.
gptme: env-only; tiktoken cache baked (no-egress); wall-clock-bounded.
mini-SWE: bare validator loop (L0+L2+L7); --environment-class local (no nested
Docker) + litellm_textbased; MSWEA guards; trajectory → /tmp.
- config_files launcher (from the opencode wave) handles PI/mini config.

Verify

734 tests green; ruff + mypy --strict clean; 15/15 stack specs valid.

Refs: rfc-006-stack-executor

🤖 Generated with Claude Code

…006 Phase 5) Three model-agnostic harnesses built + smoked IN PARALLEL by a 3-agent team (offloading the per-harness Docker recipe-discovery from the main context). All edit the real cwd; the patch is captured by host git-diff. - pi (@earendil-works/pi-coding-agent): models.json via config_files + PI_CODING_AGENT_DIR; native tool_calls → runs devstral/codestral/qwen3-235b/ glm-4-32b (qwen3-coder-30b emits text-format tool calls its proxy backend can't parse → pi no-ops; a real compat finding). - gptme: env-only (OPENAI_BASE_URL + MODEL=local/<m>, prefix stripped); tiktoken cache baked (no-egress); no turn cap → wall-clock bounds it (patch written before the verify loop, so a timeout-kill still yields a valid patch). - mini-SWE-agent: the bare validator loop (L0+L2+L7). --environment-class local (bash in /workspace, NO nested Docker) + litellm_textbased (fenced bash, not native tool_calls) + MSWEA_CONFIGURED/COST_TRACKING guards. Trajectory → /tmp. Cross-harness finding: a model's PROXY BACKEND tool-call-parser determines native- tool support — qwen3-coder-30b returns text-format calls (explains opencode 1/4); devstral/codestral/qwen3-235b/glm-4-32b emit native tool_calls. ruff + mypy --strict clean; 15/15 stack specs valid. supported_harnesses = {aider, goose, opencode, crush, cline, pi, gptme, mini-swe}. Refs: rfc-006-stack-executor Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The 3-harness wave's scored columns (--add-stack each), merged into board.json: - PI (native tool_calls): devstral 7.46, codestral 6.67, qwen3-235b 6.17 (3/4; glm-4-32b fail). Native-tool models only — the tool-parser finding holds. - gptme (text-tolerant): codestral 7.17 (1/4 — no turn cap → verify-loop timeouts on the other models; wall-clock kills before a clean finish). - mini-SWE (textbased loop): qwen3-coder-30b 7.08, qwen-3-14b 6.54 (2/4). Board now 32/45 scored, 8 harness columns. Harness display name pi→PI. Refs: rfc-006-stack-executor Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

explosivebit and others added 2 commits June 3, 2026 20:12

explosivebit merged commit 164f14f into main Jun 3, 2026
3 checks passed

explosivebit deleted the feat/harness-wave-pi-gptme-mini branch June 3, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harness): PI + gptme + mini-SWE wave — 3 columns, board 32/45 (RFC-006 Phase 5)#57

feat(harness): PI + gptme + mini-SWE wave — 3 columns, board 32/45 (RFC-006 Phase 5)#57
explosivebit merged 2 commits into
mainfrom
feat/harness-wave-pi-gptme-mini

explosivebit commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

explosivebit commented Jun 3, 2026

What

Results (be_01)

★ The tool-call-parser finding (explains the whole matrix)

How

Verify

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant