#22: gemini cmdCancel + cancel-marker sentinel + capture_error discrimination#23
Conversation
…re-error Three valid sub-tasks from issue #22, each reproduced before fixing. Sub-task 1 — Wire Gemini cmdCancel Pre-existing gap since af303f0 (PR #14, original plugin import): gemini-companion.mjs's dispatch routed \`cancel\` to fail("not_implemented"), so users had no way to cancel a Gemini background job through the documented interface. The infrastructure was already in place (spawnGemini captures pid_info via the byte-identical capturePidInfo; verifyPidInfo is shared) — just needed the cmdCancel wiring. This commit mirrors Claude's cmdCancel into Gemini (verify pid_info, signal SIGTERM/SIGKILL, emit signaled / already_dead / stale_pid / no_pid_info). Smoke tests cover the active-cancel path, already_terminal, and not_found. Sub-task 2 — Cancel-marker sentinel Reproduced: a target CLI that handles SIGTERM and exits 0 with valid output is mis-classified as \`completed\` (or as \`failed/parse_error\` with no output). Three of four scenarios in my Node-handler test confirm this; the operator's cancel intent is silently lost. Fix: cmdCancel writes <jobsDir>/<jobId>/cancel-requested.flag (mode 0600) BEFORE process.kill. executeRun's finalization reads-and-deletes the marker; if present, it passes status="cancelled" to buildJobRecord and classifyExecution short-circuits (regardless of exitCode/signal). Both companions, with mirrored helpers (cancelMarkerPath, consumeCancelMarker). Smoke regression in both targets uses the new {CLAUDE,GEMINI}_MOCK_TRAP_SIGTERM env hook on the mocks (target traps SIGTERM and exits 0 with valid fixture output) — without the marker, status would be \`completed\`. Sub-task 3 — captureDarwin/Linux capture_error discrimination Reproduced in 4 hostile environments (PATH stripped, hostile ps stub exit 1 + stderr, hostile ps stub exit 0 + empty, sandbox-exec deny-ps): the previous "every error is process_gone" wrapper would falsely promote a LIVE worker to stale via reconcile, and cmdCancel would return already_dead instead of signaling. Fix: only ps-exit-1-with-empty-stderr (BSD ps's "no such process" silent signal) and /proc/<pid>/stat ENOENT remain process_gone. Spawn errors, ps non-zero with stderr text, ps zero with empty output, /proc EACCES all become capture_error. cmdCancel adds an explicit \`unverifiable\` branch that refuses to signal (conservative-correct) instead of falling through to \`stale_pid\` (which would mislead operators). lib/identity.mjs is byte-identical between plugins, so one fix covers both. Coverage: 4 new tests probe each failure mode against a real live pid spawned by the test, plus updated assertions on the existing mocked /proc tests (malformed → capture_error, EACCES → capture_error, ENOENT race → process_gone). Investigated a 4th candidate (Gemini reviewer's "state-lock gate-kidnap → fatal EEXIST crash") and dropped it — see issue #22 comment with the two-stage stress test (1000 acquisitions + 20-victim deliberate-stretch race; zero violations) plus the mkdirSync-atomicity walkthrough. Net: 524 unit/smoke tests / 0 fail (was 519). New tests: 4 capture-error scenarios, 3 Gemini cancel scenarios, 2 SIGTERM-trap cancel scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements the cancel subcommand for the Gemini companion and enhances the cancellation mechanism for both Claude and Gemini. It introduces a 'cancel-marker' file to ensure jobs are correctly classified as cancelled even if the target process traps SIGTERM and exits with code 0. Additionally, it refines process liveness detection by distinguishing between a process being genuinely gone and errors encountered while attempting to capture process info (e.g., sandbox restrictions). Review feedback suggests using path.dirname() when creating directories for marker files to improve maintainability.
Three real failures observed on the first CI run; addressing two of them (SonarCloud is architectural — see below). 1. coverage gate — job-record.mjs lines coverage dropped 91.38 -> 89.23 because the new execution.status === "cancelled" short-circuit (added by the cancel-marker fix) had no unit test in coverage scope. The smoke test exercises it but coverage runs with CODEX_PLUGIN_SKIP_SMOKE=1. Add a pair of unit tests (claude + gemini) that drive classifyExecution directly with status=cancelled + a "successful" execution tuple, asserting the lifecycle override wins. Local coverage now reads 93.85. 2. claude smoke — T7.4 sidecar deletion test flaked on Linux CI with ENOTEMPTY on rmdir of state/<subdir>/. Pre-existing race: the worker's finalization writes meta.json (status=completed) BEFORE upsertJob updates state.json AND BEFORE writeSidecar emits stdout.log/stderr.log. The test polls only meta.json; once it sees completed it runs recursive cleanup, racing the worker's tail writes. macOS local doesn't surface it because timer + fs scheduling lines up favorably; Linux CI hits it. Add a 250ms settle delay after the test detects completion so the worker's tail writes have flushed before the recursive rmSync starts walking. Annotated as the Linux flake mitigation for the next reader. 3. pre-commit test gate timeout — scope.test.mjs (155 real-git tests, ~140s) blows the 60s pre-commit npm test window. Port the SLOW_TEST_BASENAMES skip pattern from PR #21: default npm test is fast (~14s); CI sets CODEX_PLUGIN_FULL_TESTS=1 to include scope.test.mjs. Adds an npm run test:full convenience script for local pre-PR validation. The third CI failure (SonarCloud) flags the byte-identical Claude/Gemini lib pattern as duplication (41.9% on new code, >3% threshold) and 13 Security Hotspots. Both are architectural intent (parity-by-construction + dual-use process spawn for the target CLIs); not fixable without breaking the parity guarantee. Sonar isn't a merge gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses gemini-code-assist comments 3151882114 / 3151882124 on PR #23. Both companions now derive the marker directory from the canonical cancelMarkerPath helper instead of reconstructing it inline. Single source of truth — if cancelMarkerPath ever moves, the mkdir follows. The deeper drift gap (parallel block in divergent files, no automated parity guard) is tracked in #24 for systematic fix on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: capturePidInfo(child.pid) ran synchronously after spawn() returned. On Linux, /proc/<pid>/cmdline still reflected the parent's argv during the fork→execve race window, so recorded argv0 mismatched the live argv0 at cancel time and verifyPidInfo refused to signal with 'argv0_mismatch'. Surfaced as a flake on slower CI runners (PR #23 run 25039611325, smoke claude job). Fix: extract attachPidCapture(child, onSpawn) into lib/identity.mjs (VERBATIM_FILES — byte-identity enforced across plugins). Helper wraps capture in child.once('spawn', ...) so /proc reflects the post-execve target argv. Both dispatcher libs collapse to a single call site; no parallel logic, no dual path. Regression test in both claude/gemini dispatcher unit suites asserts onSpawn fires asynchronously via the 'spawn' event, not synchronously in the spawn() executor turn — fails deterministically on the pre-fix code, passes on the post-fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #22 sub-task 2 introduced the cancel-marker sentinel as inline duplicates in claude-companion.mjs and gemini-companion.mjs — three pure functions plus the inline write block in cmdCancel. The two companions are NOT in VERBATIM_FILES (they import different dispatchers), so nothing automated catches drift between these parallel blocks. Today's gemini-code-assist nit (dirname(markerPath)) had to be applied twice manually; a one-sided edit would have silently drifted. Extract cancelMarkerPath, writeCancelMarker, consumeCancelMarker into lib/cancel-marker.mjs. Add to VERBATIM_FILES so byte-identity is enforced across both copies. Both companions collapse to import calls; no dual path. writeCancelMarker throws on failure so the caller's target-specific stderr warning stays at the call site. Adds tests/unit/cancel-marker.test.mjs covering happy path, mode 0600, ISO timestamp body, fs-error propagation, and missing-marker consume. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit's unit test imported only from the claude side, so the coverage gate (CODEX_PLUGIN_COVERAGE=1) marked plugins/gemini/scripts/lib/cancel-marker.mjs at 0% line/function and failed the 85% target. Mirrors the existing `* as GeminiIdentity` pattern in identity.test.mjs: import the gemini side and drive a minimal happy-path so the byte-identical copy gets coverage credit. Local coverage gate now passes — both cancel-marker copies at 100/100/100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements Option Y+ from the external consultation round on cancel
exit-code semantics, plus the pre-spawn marker checks that came out of
adversarial review (Vectors 1, 3, 5, 8).
Exit-code contract (both companions, byte-symmetric):
exit 0: signaled, already_dead, already_terminal, cancel_pending (NEW)
exit 1: bad_args, not_found, bad_state (NEW), signal_failed, cancel_failed (NEW)
exit 2: stale_pid, unverifiable, no_pid_info (when status=running)
cmdCancel now distinguishes truly-terminal jobs from queued jobs:
queued + cancel writes a cancel marker and returns cancel_pending exit 0,
durably recording operator intent. Unknown statuses surface as bad_state
exit 1 instead of being silently treated as queued. Marker write failure
in the queued branch returns cancel_failed exit 1, since the marker IS
the cancel mechanism for queued jobs (no SIGTERM fallback).
cmdRunWorker and executeRun both consume the cancel marker before
spawning the target binary. The cmdRunWorker check is a fast path; the
executeRun check is load-bearing — it narrows the queued race window
from "containment + scope copy + pre-snapshot + spawn" (potentially
seconds with a large repo) to the microseconds between the check and
child.once('spawn') firing. The executeRun check also covers foreground
runs that bypass cmdRunWorker.
Tests:
- queued cancel → cancel_pending, marker present, exit 0
- _run-worker honors marker → status=cancelled, no spawn
- queued + marker write fails → cancel_failed exit 1
- unknown status → bad_state exit 1
- capture_error on running job → exit 2 (was: exit 0; pre-existing test
encoded the bug per consultation Round 2)
Docs:
- commands/{claude,gemini}-cancel.md: full status × exit-code taxonomy
including cancel_failed and bad_state
- README.md: removed stale "Gemini cancel deferred" claims
- tests/unit/docs-contracts.test.mjs: updated to match wired-cancel
reality (gemini cancel was wired in PR #22 but doc-tests still
asserted "not_implemented")
Deferred to follow-up commits per the agreed 6-commit sequence:
- Class 2: assertSafeJobId at marker boundary (path traversal)
- Class 3: /proc precondition + Darwin /bin/ps absolute
- Class 4: automated docs-vs-emitted-statuses contract test
- Class 5: argv0 child-binary assertion + configureState keys
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer Vector 4 (adversarial round): cmdCancel parses state.json and passes job.id straight to writeCancelMarker. cancelMarkerPath did string concat with no validation, so a tampered state.json carrying a traversal id (e.g. "../../../escape") would land the marker outside the jobs dir. Fix: export assertSafeJobId from lib/state.mjs (already used internally for writeJobFile / resolveJobFile / state.json upserts) and call it at the top of cancelMarkerPath. writeCancelMarker and consumeCancelMarker inherit the validation because they go through cancelMarkerPath. End-to-end behavior with tampered state.json: cmdCancel queued branch calls writeCancelMarker → throws "Unsafe jobId: ..." → caught by the Fix 2 path from the prior commit → cancel_failed exit 1. No marker escapes the jobs dir. Tests: - cancelMarkerPath rejects 8 tampered values (../../../, foo/bar, .., null bytes, empty, etc.) - writeCancelMarker / consumeCancelMarker inherit the gate - Same coverage on the gemini side via * as GeminiCancelMarker Class 2 in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two platform-specific identity-capture bugs from reviewer round 2 (Finding B).
Linux (3a): captureLinux treated existsSync('/proc/<pid>/stat') === false as
proof of process_gone. In a hardened container or sandbox where /proc is
unmounted, every existsSync of a per-pid stat returns false — falsely
classifying LIVE pids as process_gone, which cmdCancel would then return
as already_dead exit 0. Fix: precondition existsSync('/proc') first; if
absent, throw capture_error (cmdCancel maps to unverifiable + exit 2).
Darwin (3b): captureDarwin invoked PATH-resolved 'ps'. A stripped or
shimmed PATH (sandbox env scrubbing, hostile stub) could break ownership
capture in two ways: (1) ps not findable → capture_error (already
handled, but undesirable); (2) silent-exit-1 stub on PATH mimics BSD
'no such pid' signal, falsely classifying LIVE pids as process_gone.
Fix: pin /bin/ps absolute. /bin/ps is part of the macOS base install
and stable; if it's missing or denied, spawnSync.error → capture_error
(safe answer).
Tests:
- Linux: /proc mounted but per-pid stat missing → process_gone (existing)
- Linux: /proc unmounted → capture_error (NEW)
- Darwin: identity.test.mjs's PATH-shim parser tests rewritten to use
spawnSync mocking via withPlatformAndChild (the new helper). Adds an
explicit assert.equal(observedBinary, "/bin/ps") regression guard.
- Darwin: identity-capture-error.test.mjs's hostile-PATH tests inverted
— they now prove PATH manipulation has NO effect on capture (Class 3b
regression guards), since /bin/ps is invoked absolute.
Class 3 in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer round 2 Finding H: the existing #25 dispatcher tests ("spawnClaude / spawnGemini: onSpawn fires asynchronously via 'spawn' event, not synchronously") only assert async timing of the callback. A regressed implementation that captures pid_info synchronously but defers the onSpawn callback would still pass them — yet the captured argv0 would be the PARENT's (the test runner = "node"), not the child binary. The reviewer's repro confirmed this: a hypothetical attachPidCapture that records `{argv0: "node /parent/argv.mjs"}` synchronously and fires onSpawn from child.once('spawn') passes every assertion the existing tests make. Fix: add a focused regression test that spawns a child with a different binary from the parent (/bin/sleep, not node) and asserts the captured argv0 reflects the child binary post-execve. Pre-execve capture would record argv0 starting with "node ..."; post-execve capture records "sleep" (Darwin ps -o comm=) or "/bin/sleep" (Linux /proc/<pid>/cmdline). The dispatcher #25 tests stay as-is — they prove a different (still valuable) property: that onSpawn fires asynchronously through the spawnClaude/spawnGemini integration path. This new test pins the underlying lib API (attachPidCapture) to post-execve correctness. Skips: non-linux/darwin platforms, missing /bin/sleep, and macOS-under- coverage (sandbox can deny ps; Linux CI run covers it). Class 5a in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing I)
Reviewer round 2 Finding I: cancel-marker.test.mjs called
configureState({envVar, fallbackBaseName, sessionEnvVar}) — but the
actual configureState API uses pluginDataEnv, fallbackStateRootDir,
sessionIdEnv. The wrong keys are silently ignored (the function only
applies recognized keys), so the call was a complete no-op. The tests
still passed because process.env.CLAUDE_PLUGIN_DATA was set on the
next line, providing the actual config — the configureState call was
dead code masking the API misuse.
Fix: use correct keys at all three call sites in cancel-marker.test.mjs:
- freshWorkspace() helper (claude side, line 23)
- "gemini cancelMarkerPath: same path-traversal defense" (gemini side, line 163, added in Class 2)
- "gemini cancel-marker: byte-identical helper exercised on the gemini side" (gemini side, line 186)
The configureState call now meaningfully resets the module config to
its claude-side / gemini-side defaults — the original intent of the
helpers, which the wrong keys were silently defeating.
The other configureState callsite in tests/smoke/gemini-companion.smoke.test.mjs:805
already uses correct keys (pluginDataEnv, sessionIdEnv) — no change there.
Class 5b in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Follow-up commit
Verification on the rebased commit:
|
|
Implemented the repo-owned review reliability fixes from #28. What changed:
Verification:
Pushed commit: |
|
Captures the load-bearing architectural decisions made during the early development of this repo so future contributors can understand why the implementation differs from a simple upstream port of openai/codex-plugin-cc. Topics covered: - Containment vs. scope as orthogonal dimensions (the most important architectural choice) - ModeProfile as the single source of mode defaults (model tier, permission mode, disallowed tools, containment, scope, dispose, context-stripping, add-dir) - One JobRecord shape across foreground/background/status/result paths - Distinct identity types (job_id vs target session IDs vs resume chains vs PID ownership) - Shared-library checks need behavior tests, not only byte identity - Upstream relationship to openai/codex-plugin-cc Doc was written during fix/022-followups development but never committed before PR #23 merged. This commit lands it on main as a standalone docs addition. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>



Closes #22. Closes #24. Closes #25. Closes #26. Closes #28.
Summary
Three sub-tasks from #22, each reproduced before fixing. For reviewers in a hurry, the parts that need adversarial attention are #2 and #3. #1 is a mechanical mirror of existing Claude code.
cmdCancelclaude-companion.mjs's existingcmdCancel(verify pid_info → SIGTERM/SIGKILL → emit signaled/already_dead/stale_pid/no_pid_info/unverifiable). Any bug here also exists in the unmodified Claude original.plugins/gemini/scripts/gemini-companion.mjs(+~70 lines), 3 smoke testscmdCancelwrites<jobsDir>/<jobId>/cancel-requested.flag(mode 0600) BEFOREprocess.kill.executeRun's finalization reads-and-deletes the marker and forcesstatus=cancelled. NewclassifyExecutionshort-circuit onexecution.status === "cancelled".lib/job-record.mjs, both mocks (MOCK_TRAP_SIGTERM), 2 smoke testscapture_error/process_gonediscriminationverifyPidInfo.captureDarwin/captureLinuxpreviously wrapped every failure asprocess_gone. New rule: onlyps exit 1 + empty stderrand/proc/<pid>/statENOENT remainprocess_gone. Spawn errors, ps exit non-zero with stderr, ps exit 0 with empty output, /proc EACCES becomecapture_error.cmdCanceladds anunverifiablebranch.lib/identity.mjs(byte-identical pair), 4 new unit tests, plus updated assertions on existing mocked /proc testsWhy each was filed
af303f0— Gemini users had no documented way to cancel a background job.completed. 3 of 4 Node-handler scenarios in my repro confirm it. Operator's cancel intent is silently lost.PATH="", hostile ps stub exit 1+stderr, hostile ps stub exit 0+empty,sandbox-exec deny-ps). All four mis-classified a LIVE pid asprocess_gone. In production this would causereconcileActiveJobsto falsely promote a live worker tostale, andcmdCancelto silently no-op asalready_dead.A 4th candidate from the original adversarial review on PR #21 (state-lock gate-kidnap → fatal
EEXIST) was investigated and dropped — see issue #22 for the two-stage stress test (1000 + 20-victim race) and the mkdirSync-atomicity walkthrough.Relationship to PR #21
PR #23 is branched from
origin/main, not from PR #21. Practical implications for reviewers:finalization_failedclassification, sharedlib/git-env.mjs, stale orphan continuability, legacy git/workspace test scrub,cmdResultEISDIR — all already on PR #16: post-merge hardening (Items 1-9) #21's branch.lib/job-record.mjsstatus === "cancelled"short-circuit is adjacent to (but doesn't depend on) PR #16: post-merge hardening (Items 1-9) #21'sstatus === "stale"short-circuit. Both will land in the same function via separate PRs; expect a small textual conflict during whichever lands second.scripts/ci/run-tests.mjs(added here as a CI-fixup commit) is also in PR #16: post-merge hardening (Items 1-9) #21. Whichever lands second will have a no-op duplicate to drop during rebase.Test plan
npm test— 526 unit/smoke tests / 0 failCODEX_PLUGIN_FULL_TESTS=1 npm test— full matrix passesnpm run lint— manifests validlib/identity.mjsverified)🤖 Generated with Claude Code