Skip to content

#16: post-merge hardening (Items 1-9)#21

Open
seungpyoson wants to merge 15 commits intomainfrom
fix/016-post-merge-hardening
Open

#16: post-merge hardening (Items 1-9)#21
seungpyoson wants to merge 15 commits intomainfrom
fix/016-post-merge-hardening

Conversation

@seungpyoson
Copy link
Copy Markdown
Owner

Closes #16.

Resolves all eight follow-ups from #16 plus the test-harness isolation addendum. Companion-side hardening across both Claude and Gemini plugins; no production-API surface change. Sits on top of two earlier docs/license merges (PR #18 and PR #19) that landed into the branch during this work.

Summary

  • Item 1 — finalization vs sidecar. Split executeRun()'s persistence into contractual (meta + state, fatal with fallback failed-record) vs diagnostic (sidecar warnings only). Fixes the bug where a failed stdout.log write returned finalization_failed despite a successful target run.
  • Item 2 — cancel/cancelled lifecycle. classifyExecution now produces a real cancelled status from a SIGTERM/SIGKILL exit (with timedOut taking precedence as error_code: timeout). Removes the manual-JSON-mutation tests that previously stood in for cancel.
  • Item 3 — orphan reconciliation. New byte-identical lib/reconcile.mjs walks active records and promotes them to stale (continuable) when pid_info proves the worker is gone or the job has been queued past the orphan window. Wired into cmdStatus for self-healing.
  • Item 4 — status UX. Default status filter now surfaces cancelled + stale alongside running/completed/failed. Only queued (transient pre-spawn) is hidden by default.
  • Item 5 — state-lock diagnostics. One-line stderr warnings (no secret leakage) for corrupt owner.json, cross-host owner contention, and orphan-dir put-back failures. Acquire-after-reclaim while still holding .state.lock.gate so bursty contention stops throwing away work.
  • Item 6 — working-tree privacy + git retry. listIgnoredUntrackedFiles() retries git ls-files 3× / 50ms before failing closed. Operator-facing docs (skill MDs + Gemini result command) now state explicitly that the gitignored privacy default applies only inside a git worktree.
  • Item 7 — env scrub future-proofing. Added LITELLM_ and OLLAMA_ to PROVIDER_PREFIXES (router/proxy ecosystems that re-route target traffic). Documented decision NOT to scrub HTTP_PROXY / HTTPS_PROXY / NO_PROXY (corporate networks depend on those).
  • Item 8 — configureState() test hygiene. New withStateConfig(state, override, fn) helper wraps every short-lock-timeout test so an early failure cannot leak overrides between tests.
  • Item 9 (addendum) — test-harness isolation. Pre-commit npm test was hijacking the caller checkout into fixture git state when launched from a hook context. New tests/helpers/fixture-git.mjs is the single source of truth for fixture git invocations (scrubs every dangerous GIT_* env var, pins cwd, deterministic identity). Runner-level scrub in scripts/ci/run-tests.mjs as defense in depth. Default npm test now skips tests/unit/scope.test.mjs (155 real-git tests, ~140s) so it fits the 60s pre-commit gate; CI sets CODEX_PLUGIN_FULL_TESTS=1 (or npm run test:full locally) to opt back in.

Test plan

  • git diff --check — clean
  • npm run lint — pass
  • npm run lint:self-test — pass
  • npm test (fast subset, default — pre-commit gate) — 374 pass / 0 fail / 4 skipped (~17 s)
  • CODEX_PLUGIN_FULL_TESTS=1 npm test (full unit matrix — CI mode) — 529 pass / 0 fail / 4 skipped
  • npm run smoke:claude — 28 pass / 0 fail
  • npm run smoke:gemini — 18 pass / 0 fail
  • CODEX_PLUGIN_SKIP_SMOKE=1 COVERAGE_ENFORCE_TARGET=1 CODEX_PLUGIN_FULL_TESTS=1 npm run test:coverage — coverage target met (85.00%)

Notes

  • PR #18 (docs archive) and PR #19 (AGPL license switch) commits are part of this PR's diff because they were merged into fix/016-post-merge-hardening rather than main. They land into main via this PR.
  • #13 deferred upstream scope is preserved — no ping slash command files restored.
  • CLAUDE.md (new) documents the fast/full test split for future sessions.
  • Per-issue mapping (each follow-up → exact code/tests/docs changed) is in the issue close-out comment: Post-merge follow-ups: lifecycle, state, and review hardening polish #16 (comment)
  • This PR is approved to land with 3 pre-existing audit findings on state.mjs and the two SKILL.md files. None were introduced by this PR; they predate Post-merge follow-ups: lifecycle, state, and review hardening polish #16 and should be tracked / resolved as a follow-up audit pass after merge.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to job lifecycle management, including orphan job reconciliation, enhanced finalization error handling, and robust test isolation for git-based fixtures. It also updates the project license to AGPL-3.0-only and adds diagnostic logging for state lock contention. I have provided feedback on improving error reporting during fallback record persistence and suggested refactoring the git fixture helper for better maintainability.

Comment thread plugins/claude/scripts/claude-companion.mjs Outdated
Comment thread plugins/gemini/scripts/gemini-companion.mjs Outdated
Comment thread tests/helpers/fixture-git.mjs
seungpyoson pushed a commit that referenced this pull request Apr 28, 2026
… MED 1)

Four reviewers (codex/kimi/glm/gemini) ran an adversarial pass on PR #21.
This commit addresses every finding I could substantiate with a repro and
folds the fixes onto the *class* of problem rather than patching instances.

Confirmed bugs and fixes:

BLOCKER 1 — finalization fallback clobbered good meta on state-only failure
  Repro: writeJobFile succeeds, upsertJob fails (lock timeout) → fallback
  re-issues writeJobFile unconditionally → "completed" meta becomes
  "failed/spawn_failed". Fix: split fallback per error side. metaError →
  rewrite both (state was rolled back too); stateError-only → preserve
  meta, retry upsertJob with the GOOD record before falling back.

BLOCKER 2 — reconcile TOCTOU race overwrote terminal completed records
  Repro: reconcile reads meta=running, classifies stale; worker writes
  completed mid-flight; reconcile then writes stale → "REAL_WORKER_RESULT"
  permanently lost. Fix: new commitJobRecordIfActive primitive in
  state.mjs holds the state lock across an in-lock CAS read of meta.json
  + the write; companion finalization moves to commitJobRecord so both
  writers serialize. Either order is safe — terminal always wins.

HIGH 1 — finalization_failed misclassified as spawn_failed
  classifyExecution short-circuited on errorMessage before signal/timedOut.
  Disk-full / lock-timeout finalization errors looked like missing-binary
  errors to monitoring. Fix: detect "finalization_failed:" prefix and emit
  error_code=finalization_failed.

HIGH 2/3 — GIT_CONFIG_GLOBAL leaked through 4 of 5 scrub callsites
  Repro: GIT_CONFIG_GLOBAL=/evil node --test … → fixture branch became
  "injected-master". Companion mutation-detection git would inherit
  safe.directory=*. Fix: extract STRIPPED_GIT_ENV_KEYS + cleanGitEnv into
  byte-identical lib/git-env.mjs (added to VERBATIM_FILES). Companions,
  scope.mjs, fixture helper, and runner all import the canonical list.
  Adds GIT_CONFIG_GLOBAL/SYSTEM, GIT_TRACE family, GIT_OPTIONAL_LOCKS,
  GIT_TERMINAL_PROMPT, GIT_PROTOCOL, GIT_AUTO_GC.

HIGH 4 — stale records from fresh-running orphans not actually continuable
  PR #21 claims stale is continuable, but onSpawn writes claude_session_id=null
  so reconcile carries null, so cmdContinue rejects with "no claude_session_id".
  Fix: cmdContinue now distinguishes the stale-orphan case and returns an
  actionable suggestion ("re-run from scratch via …") with the prior
  mode/cwd preserved.

HIGH 5 — git/workspace tests didn't scrub GIT_*
  Direct \`node --test tests/unit/git.test.mjs\` with caller GIT_DIR set
  hijacked the fixture into the wrong repo. Same gap on the production
  side: lib/git.mjs's git() forwarded process.env unscrubbed → the
  workspace-root resolution at every companion subcommand entry was
  vulnerable too. Fix: refactor both tests to use fixtureGit; add
  cleanGitEnv to lib/git.mjs.

MED 1 — cmdResult crashed unhandled EISDIR when meta.json was a directory
  Wrap readFileSync in cmdResult; return {ok:false, error:"read_failed",
  error_code:"EISDIR"} instead of an unhandled stacktrace.

Reviewer findings rejected with reasoning (see PR comment):
  - SIGTERM-trapped target classified as completed (not cancelled): inherent
    Node child_process limitation; closing it requires a marker-file dance
    out of scope for #16. Tracked as follow-up.
  - State-lock gate-kidnap → EEXIST crash: speculation, no repro; mkdirSync
    atomicity + gate ordering precludes the scenario.
  - Gemini cmdCancel not implemented: real product gap, but pre-existing
    since af303f0; not introduced by PR #21. Tracked as follow-up.

New tests:
  - tests/unit/git-env.test.mjs (canonical strip list + GIT_CONFIG_GLOBAL leak repro)
  - tests/unit/commit-job-record.test.mjs (atomic + CAS primitives)
  - tests/unit/reconcile.test.mjs adds TOCTOU regression guard
  - tests/unit/job-record.test.mjs adds finalization_failed classification
  - tests/smoke/claude-companion.smoke.test.mjs adds result-EISDIR friendly error

Net: 565 unit tests / 0 fail (was 529); coverage baseline met (state.mjs and
new git-env.mjs ratchet up; reconcile.mjs branch baseline lowered to 81 since
3 branches moved into commitJobRecordIfActive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seungpyoson
Copy link
Copy Markdown
Owner Author

Adversarial review triage — every reviewer finding addressed

Four reviewers (codex / kimi / glm / gemini) returned findings. Each was reproduced against the code before being acted on; a few contradicted each other and I rejected the speculative ones with reasoning. Commit 25f9cf3 lands all the substantiated fixes.

FIXED in 25f9cf3

# Finding Repro evidence Fix
BLOCKER 1 Finalization fallback clobbered good meta when only upsertJob failed status flipped completedfailed/spawn_failed; result preserved but status lied Split fallback per error side. metaError → rewrite both (state was rolled back too). stateError-only → preserve meta, retry upsertJob with the GOOD record before falling back. (claude-companion.mjs, gemini-companion.mjs)
BLOCKER 2 Reconcile TOCTOU race overwrote terminal completed records result field went "REAL_WORKER_RESULT"null, status completedstale New commitJobRecordIfActive(cwd, jobId, builder) primitive in state.mjs holds the state lock around an in-lock CAS read of meta.json + the write. Companion finalization migrated to commitJobRecord so both writers serialize. Either order is safe — terminal always wins. Regression guard in tests/unit/reconcile.test.mjs.
HIGH 1 errorMessage short-circuit emitted error_code: spawn_failed for finalization_failed: records buildJobRecord({errorMessage:"finalization_failed: state=lock timeout"}) returned error_code: "spawn_failed" classifyExecution detects finalization_failed: prefix → error_code: "finalization_failed". Tests in job-record.test.mjs.
HIGH 2/3 GIT_CONFIG_GLOBAL leaked through 4 of 5 scrub callsites GIT_CONFIG_GLOBAL=/evil node --test ... → fixture branch became "injected-master". Companion strip lists were 5 keys vs scope.mjs's 14 Extracted STRIPPED_GIT_ENV_KEYS + cleanGitEnv into byte-identical lib/git-env.mjs (added to VERBATIM_FILES). Companions, scope.mjs, fixture helper, and runner all import the canonical list. Adds GIT_CONFIG_GLOBAL/SYSTEM, GIT_TRACE family, GIT_OPTIONAL_LOCKS, GIT_TERMINAL_PROMPT, GIT_PROTOCOL, GIT_AUTO_GC. Adversarial repro now lives in tests/unit/git-env.test.mjs.
HIGH 4 Stale records from a fresh running orphan are not continuable continue --job <stale> rejects with "no claude_session_id to resume" because onSpawn writes claude_session_id: null cmdContinue distinguishes the stale-orphan case and returns an actionable suggestion (Re-run from scratch: run --mode <prior.mode> --cwd <prior.cwd> -- "<your prompt>"). Same for Gemini.
HIGH 5 tests/unit/git.test.mjs and workspace.test.mjs didn't scrub GIT_* GIT_DIR=/bad node --test tests/unit/git.test.mjs failed; plus the same gap on the production side: lib/git.mjs::git() forwarded process.env unscrubbed → workspace-root resolution at every companion entry was vulnerable Refactored both tests to use fixtureGit. Added cleanGitEnv to lib/git.mjs so production code is robust too.
MED 1 cmdResult crashed unhandled EISDIR when meta.json was a directory claude-companion: unhandled: Error: EISDIR: illegal operation on a directory cmdResult wraps readFileSync; returns {ok:false, error:"read_failed", error_code:"EISDIR"}. Smoke regression in claude-companion.smoke.test.mjs.

REJECTED with reasoning

  • SIGTERM-trapped target classified as failed/parse_error not cancelled (Codex HIGH, GLM HIGH, Gemini BLOCKER; Kimi correctly ruled out). Inherent Node child_process limitation: a child that handles SIGTERM presents signal=null. Distinguishing it from a normal exit-0-no-output requires a marker-file dance (cmdCancel writes a sentinel; finalization checks it) — non-trivial and out of scope for Post-merge follow-ups: lifecycle, state, and review hardening polish #16. Tracked as follow-up.
  • State-lock gate-kidnap → fatal EEXIST crash (Gemini BLOCKER). No reproducible scenario provided. Walked the code: after a successful tryReclaimStaleLock, no concurrent writer can have created lockDir because they'd need to acquire the gate first, and mkdirSync is atomic. Speculation, rejected.
  • Gemini cmdCancel not implemented (Gemini BLOCKER). Real product gap, but pre-existing since af303f0 (the original plugin import) — not introduced by PR #16: post-merge hardening (Items 1-9) #21. Tracked as follow-up.
  • Sandbox-denied verifyPidInfo (Kimi suspicion). No repro. Tracked as follow-up.

Gates

  • 565 unit tests / 0 fail (was 529 before this commit)
  • 28 + 18 smoke tests / 0 fail
  • Coverage baseline met — state.mjs and new git-env.mjs ratchet up; reconcile.mjs branch baseline lowered to 81% because 3 branches moved into commitJobRecordIfActive in state.mjs
  • All 8 byte-identical pairs verified

Follow-ups consolidated into one issue per the MECE rule (linked once filed).

@seungpyoson
Copy link
Copy Markdown
Owner Author

Follow-ups for the rejected/deferred items tracked at #22.

seungpyoson pushed a commit that referenced this pull request Apr 28, 2026
Three real failures observed on the first CI run; addressing two of them
(SonarCloud is architectural — see below).

1. coverage gate — job-record.mjs lines coverage dropped 91.38 -> 89.23 because
   the new execution.status === "cancelled" short-circuit (added by the
   cancel-marker fix) had no unit test in coverage scope. The smoke test
   exercises it but coverage runs with CODEX_PLUGIN_SKIP_SMOKE=1. Add a
   pair of unit tests (claude + gemini) that drive classifyExecution
   directly with status=cancelled + a "successful" execution tuple,
   asserting the lifecycle override wins. Local coverage now reads 93.85.

2. claude smoke — T7.4 sidecar deletion test flaked on Linux CI with
   ENOTEMPTY on rmdir of state/<subdir>/. Pre-existing race: the worker's
   finalization writes meta.json (status=completed) BEFORE upsertJob
   updates state.json AND BEFORE writeSidecar emits stdout.log/stderr.log.
   The test polls only meta.json; once it sees completed it runs
   recursive cleanup, racing the worker's tail writes. macOS local
   doesn't surface it because timer + fs scheduling lines up favorably;
   Linux CI hits it. Add a 250ms settle delay after the test detects
   completion so the worker's tail writes have flushed before the
   recursive rmSync starts walking. Annotated as the Linux flake
   mitigation for the next reader.

3. pre-commit test gate timeout — scope.test.mjs (155 real-git tests, ~140s)
   blows the 60s pre-commit npm test window. Port the SLOW_TEST_BASENAMES
   skip pattern from PR #21: default npm test is fast (~14s); CI sets
   CODEX_PLUGIN_FULL_TESTS=1 to include scope.test.mjs. Adds an
   npm run test:full convenience script for local pre-PR validation.

The third CI failure (SonarCloud) flags the byte-identical Claude/Gemini
lib pattern as duplication (41.9% on new code, >3% threshold) and 13
Security Hotspots. Both are architectural intent (parity-by-construction
+ dual-use process spawn for the target CLIs); not fixable without
breaking the parity guarantee. Sonar isn't a merge gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
seungpyoson added a commit that referenced this pull request Apr 29, 2026
…mination (#23)

* fix(#22): wire Gemini cancel + cancel-marker sentinel + ps/proc capture-error

Three valid sub-tasks from issue #22, each reproduced before fixing.

Sub-task 1 — Wire Gemini cmdCancel
  Pre-existing gap since af303f0 (PR #14, original plugin import):
  gemini-companion.mjs's dispatch routed \`cancel\` to fail("not_implemented"),
  so users had no way to cancel a Gemini background job through the
  documented interface. The infrastructure was already in place (spawnGemini
  captures pid_info via the byte-identical capturePidInfo; verifyPidInfo
  is shared) — just needed the cmdCancel wiring. This commit mirrors
  Claude's cmdCancel into Gemini (verify pid_info, signal SIGTERM/SIGKILL,
  emit signaled / already_dead / stale_pid / no_pid_info). Smoke tests
  cover the active-cancel path, already_terminal, and not_found.

Sub-task 2 — Cancel-marker sentinel
  Reproduced: a target CLI that handles SIGTERM and exits 0 with valid
  output is mis-classified as \`completed\` (or as \`failed/parse_error\` with
  no output). Three of four scenarios in my Node-handler test confirm
  this; the operator's cancel intent is silently lost.
  Fix: cmdCancel writes <jobsDir>/<jobId>/cancel-requested.flag (mode 0600)
  BEFORE process.kill. executeRun's finalization reads-and-deletes the
  marker; if present, it passes status="cancelled" to buildJobRecord and
  classifyExecution short-circuits (regardless of exitCode/signal). Both
  companions, with mirrored helpers (cancelMarkerPath, consumeCancelMarker).
  Smoke regression in both targets uses the new {CLAUDE,GEMINI}_MOCK_TRAP_SIGTERM
  env hook on the mocks (target traps SIGTERM and exits 0 with valid
  fixture output) — without the marker, status would be \`completed\`.

Sub-task 3 — captureDarwin/Linux capture_error discrimination
  Reproduced in 4 hostile environments (PATH stripped, hostile ps stub
  exit 1 + stderr, hostile ps stub exit 0 + empty, sandbox-exec deny-ps):
  the previous "every error is process_gone" wrapper would falsely promote
  a LIVE worker to stale via reconcile, and cmdCancel would return
  already_dead instead of signaling.
  Fix: only ps-exit-1-with-empty-stderr (BSD ps's "no such process" silent
  signal) and /proc/<pid>/stat ENOENT remain process_gone. Spawn errors,
  ps non-zero with stderr text, ps zero with empty output, /proc EACCES
  all become capture_error. cmdCancel adds an explicit \`unverifiable\`
  branch that refuses to signal (conservative-correct) instead of falling
  through to \`stale_pid\` (which would mislead operators). lib/identity.mjs
  is byte-identical between plugins, so one fix covers both. Coverage:
  4 new tests probe each failure mode against a real live pid spawned by
  the test, plus updated assertions on the existing mocked /proc tests
  (malformed → capture_error, EACCES → capture_error, ENOENT race →
  process_gone).

Investigated a 4th candidate (Gemini reviewer's "state-lock gate-kidnap →
fatal EEXIST crash") and dropped it — see issue #22 comment with the
two-stage stress test (1000 acquisitions + 20-victim deliberate-stretch
race; zero violations) plus the mkdirSync-atomicity walkthrough.

Net: 524 unit/smoke tests / 0 fail (was 519). New tests: 4 capture-error
scenarios, 3 Gemini cancel scenarios, 2 SIGTERM-trap cancel scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#22): address CI failures on PR #23

Three real failures observed on the first CI run; addressing two of them
(SonarCloud is architectural — see below).

1. coverage gate — job-record.mjs lines coverage dropped 91.38 -> 89.23 because
   the new execution.status === "cancelled" short-circuit (added by the
   cancel-marker fix) had no unit test in coverage scope. The smoke test
   exercises it but coverage runs with CODEX_PLUGIN_SKIP_SMOKE=1. Add a
   pair of unit tests (claude + gemini) that drive classifyExecution
   directly with status=cancelled + a "successful" execution tuple,
   asserting the lifecycle override wins. Local coverage now reads 93.85.

2. claude smoke — T7.4 sidecar deletion test flaked on Linux CI with
   ENOTEMPTY on rmdir of state/<subdir>/. Pre-existing race: the worker's
   finalization writes meta.json (status=completed) BEFORE upsertJob
   updates state.json AND BEFORE writeSidecar emits stdout.log/stderr.log.
   The test polls only meta.json; once it sees completed it runs
   recursive cleanup, racing the worker's tail writes. macOS local
   doesn't surface it because timer + fs scheduling lines up favorably;
   Linux CI hits it. Add a 250ms settle delay after the test detects
   completion so the worker's tail writes have flushed before the
   recursive rmSync starts walking. Annotated as the Linux flake
   mitigation for the next reader.

3. pre-commit test gate timeout — scope.test.mjs (155 real-git tests, ~140s)
   blows the 60s pre-commit npm test window. Port the SLOW_TEST_BASENAMES
   skip pattern from PR #21: default npm test is fast (~14s); CI sets
   CODEX_PLUGIN_FULL_TESTS=1 to include scope.test.mjs. Adds an
   npm run test:full convenience script for local pre-PR validation.

The third CI failure (SonarCloud) flags the byte-identical Claude/Gemini
lib pattern as duplication (41.9% on new code, >3% threshold) and 13
Security Hotspots. Both are architectural intent (parity-by-construction
+ dual-use process spawn for the target CLIs); not fixable without
breaking the parity guarantee. Sonar isn't a merge gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#22): use dirname(markerPath) for cancel-marker mkdir

Addresses gemini-code-assist comments 3151882114 / 3151882124 on PR #23.
Both companions now derive the marker directory from the canonical
cancelMarkerPath helper instead of reconstructing it inline. Single
source of truth — if cancelMarkerPath ever moves, the mkdir follows.

The deeper drift gap (parallel block in divergent files, no automated
parity guard) is tracked in #24 for systematic fix on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#25): defer pid_info capture to child 'spawn' event

Pre-fix: capturePidInfo(child.pid) ran synchronously after spawn()
returned. On Linux, /proc/<pid>/cmdline still reflected the parent's
argv during the fork→execve race window, so recorded argv0 mismatched
the live argv0 at cancel time and verifyPidInfo refused to signal with
'argv0_mismatch'. Surfaced as a flake on slower CI runners
(PR #23 run 25039611325, smoke claude job).

Fix: extract attachPidCapture(child, onSpawn) into lib/identity.mjs
(VERBATIM_FILES — byte-identity enforced across plugins). Helper wraps
capture in child.once('spawn', ...) so /proc reflects the post-execve
target argv. Both dispatcher libs collapse to a single call site;
no parallel logic, no dual path.

Regression test in both claude/gemini dispatcher unit suites asserts
onSpawn fires asynchronously via the 'spawn' event, not synchronously
in the spawn() executor turn — fails deterministically on the pre-fix
code, passes on the post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#24): extract cancel-marker into shared VERBATIM lib

Issue #22 sub-task 2 introduced the cancel-marker sentinel as inline
duplicates in claude-companion.mjs and gemini-companion.mjs — three
pure functions plus the inline write block in cmdCancel. The two
companions are NOT in VERBATIM_FILES (they import different
dispatchers), so nothing automated catches drift between these
parallel blocks. Today's gemini-code-assist nit (dirname(markerPath))
had to be applied twice manually; a one-sided edit would have silently
drifted.

Extract cancelMarkerPath, writeCancelMarker, consumeCancelMarker into
lib/cancel-marker.mjs. Add to VERBATIM_FILES so byte-identity is
enforced across both copies. Both companions collapse to import calls;
no dual path. writeCancelMarker throws on failure so the caller's
target-specific stderr warning stays at the call site.

Adds tests/unit/cancel-marker.test.mjs covering happy path, mode 0600,
ISO timestamp body, fs-error propagation, and missing-marker consume.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#24): exercise gemini cancel-marker copy for coverage parity

The prior commit's unit test imported only from the claude side, so
the coverage gate (CODEX_PLUGIN_COVERAGE=1) marked
plugins/gemini/scripts/lib/cancel-marker.mjs at 0% line/function and
failed the 85% target. Mirrors the existing `* as GeminiIdentity`
pattern in identity.test.mjs: import the gemini side and drive a
minimal happy-path so the byte-identical copy gets coverage credit.

Local coverage gate now passes — both cancel-marker copies at 100/100/100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#22): cancel exit-code taxonomy + queued-cancel safety

Implements Option Y+ from the external consultation round on cancel
exit-code semantics, plus the pre-spawn marker checks that came out of
adversarial review (Vectors 1, 3, 5, 8).

Exit-code contract (both companions, byte-symmetric):
  exit 0: signaled, already_dead, already_terminal, cancel_pending (NEW)
  exit 1: bad_args, not_found, bad_state (NEW), signal_failed, cancel_failed (NEW)
  exit 2: stale_pid, unverifiable, no_pid_info (when status=running)

cmdCancel now distinguishes truly-terminal jobs from queued jobs:
queued + cancel writes a cancel marker and returns cancel_pending exit 0,
durably recording operator intent. Unknown statuses surface as bad_state
exit 1 instead of being silently treated as queued. Marker write failure
in the queued branch returns cancel_failed exit 1, since the marker IS
the cancel mechanism for queued jobs (no SIGTERM fallback).

cmdRunWorker and executeRun both consume the cancel marker before
spawning the target binary. The cmdRunWorker check is a fast path; the
executeRun check is load-bearing — it narrows the queued race window
from "containment + scope copy + pre-snapshot + spawn" (potentially
seconds with a large repo) to the microseconds between the check and
child.once('spawn') firing. The executeRun check also covers foreground
runs that bypass cmdRunWorker.

Tests:
- queued cancel → cancel_pending, marker present, exit 0
- _run-worker honors marker → status=cancelled, no spawn
- queued + marker write fails → cancel_failed exit 1
- unknown status → bad_state exit 1
- capture_error on running job → exit 2 (was: exit 0; pre-existing test
  encoded the bug per consultation Round 2)

Docs:
- commands/{claude,gemini}-cancel.md: full status × exit-code taxonomy
  including cancel_failed and bad_state
- README.md: removed stale "Gemini cancel deferred" claims
- tests/unit/docs-contracts.test.mjs: updated to match wired-cancel
  reality (gemini cancel was wired in PR #22 but doc-tests still
  asserted "not_implemented")

Deferred to follow-up commits per the agreed 6-commit sequence:
- Class 2: assertSafeJobId at marker boundary (path traversal)
- Class 3: /proc precondition + Darwin /bin/ps absolute
- Class 4: automated docs-vs-emitted-statuses contract test
- Class 5: argv0 child-binary assertion + configureState keys

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#22): validate jobId at cancel-marker boundary (path traversal)

Reviewer Vector 4 (adversarial round): cmdCancel parses state.json and
passes job.id straight to writeCancelMarker. cancelMarkerPath did string
concat with no validation, so a tampered state.json carrying a traversal
id (e.g. "../../../escape") would land the marker outside the jobs dir.

Fix: export assertSafeJobId from lib/state.mjs (already used internally
for writeJobFile / resolveJobFile / state.json upserts) and call it at
the top of cancelMarkerPath. writeCancelMarker and consumeCancelMarker
inherit the validation because they go through cancelMarkerPath.

End-to-end behavior with tampered state.json: cmdCancel queued branch
calls writeCancelMarker → throws "Unsafe jobId: ..." → caught by the
Fix 2 path from the prior commit → cancel_failed exit 1. No marker
escapes the jobs dir.

Tests:
- cancelMarkerPath rejects 8 tampered values (../../../, foo/bar, ..,
  null bytes, empty, etc.)
- writeCancelMarker / consumeCancelMarker inherit the gate
- Same coverage on the gemini side via * as GeminiCancelMarker

Class 2 in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#22): /proc precondition + Darwin /bin/ps absolute (Finding B)

Two platform-specific identity-capture bugs from reviewer round 2 (Finding B).

Linux (3a): captureLinux treated existsSync('/proc/<pid>/stat') === false as
proof of process_gone. In a hardened container or sandbox where /proc is
unmounted, every existsSync of a per-pid stat returns false — falsely
classifying LIVE pids as process_gone, which cmdCancel would then return
as already_dead exit 0. Fix: precondition existsSync('/proc') first; if
absent, throw capture_error (cmdCancel maps to unverifiable + exit 2).

Darwin (3b): captureDarwin invoked PATH-resolved 'ps'. A stripped or
shimmed PATH (sandbox env scrubbing, hostile stub) could break ownership
capture in two ways: (1) ps not findable → capture_error (already
handled, but undesirable); (2) silent-exit-1 stub on PATH mimics BSD
'no such pid' signal, falsely classifying LIVE pids as process_gone.
Fix: pin /bin/ps absolute. /bin/ps is part of the macOS base install
and stable; if it's missing or denied, spawnSync.error → capture_error
(safe answer).

Tests:
- Linux: /proc mounted but per-pid stat missing → process_gone (existing)
- Linux: /proc unmounted → capture_error (NEW)
- Darwin: identity.test.mjs's PATH-shim parser tests rewritten to use
  spawnSync mocking via withPlatformAndChild (the new helper). Adds an
  explicit assert.equal(observedBinary, "/bin/ps") regression guard.
- Darwin: identity-capture-error.test.mjs's hostile-PATH tests inverted
  — they now prove PATH manipulation has NO effect on capture (Class 3b
  regression guards), since /bin/ps is invoked absolute.

Class 3 in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(#22): strengthen attachPidCapture regression test (Finding H)

Reviewer round 2 Finding H: the existing #25 dispatcher tests
("spawnClaude / spawnGemini: onSpawn fires asynchronously via 'spawn'
event, not synchronously") only assert async timing of the callback.
A regressed implementation that captures pid_info synchronously but
defers the onSpawn callback would still pass them — yet the captured
argv0 would be the PARENT's (the test runner = "node"), not the child
binary. The reviewer's repro confirmed this: a hypothetical
attachPidCapture that records `{argv0: "node /parent/argv.mjs"}`
synchronously and fires onSpawn from child.once('spawn') passes every
assertion the existing tests make.

Fix: add a focused regression test that spawns a child with a different
binary from the parent (/bin/sleep, not node) and asserts the captured
argv0 reflects the child binary post-execve. Pre-execve capture would
record argv0 starting with "node ..."; post-execve capture records
"sleep" (Darwin ps -o comm=) or "/bin/sleep" (Linux /proc/<pid>/cmdline).

The dispatcher #25 tests stay as-is — they prove a different (still
valuable) property: that onSpawn fires asynchronously through the
spawnClaude/spawnGemini integration path. This new test pins the
underlying lib API (attachPidCapture) to post-execve correctness.

Skips: non-linux/darwin platforms, missing /bin/sleep, and macOS-under-
coverage (sandbox can deny ps; Linux CI run covers it).

Class 5a in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(#22): fix configureState wrong keys in cancel-marker tests (Finding I)

Reviewer round 2 Finding I: cancel-marker.test.mjs called
configureState({envVar, fallbackBaseName, sessionEnvVar}) — but the
actual configureState API uses pluginDataEnv, fallbackStateRootDir,
sessionIdEnv. The wrong keys are silently ignored (the function only
applies recognized keys), so the call was a complete no-op. The tests
still passed because process.env.CLAUDE_PLUGIN_DATA was set on the
next line, providing the actual config — the configureState call was
dead code masking the API misuse.

Fix: use correct keys at all three call sites in cancel-marker.test.mjs:
- freshWorkspace() helper (claude side, line 23)
- "gemini cancelMarkerPath: same path-traversal defense" (gemini side, line 163, added in Class 2)
- "gemini cancel-marker: byte-identical helper exercised on the gemini side" (gemini side, line 186)

The configureState call now meaningfully resets the module config to
its claude-side / gemini-side defaults — the original intent of the
helpers, which the wrong keys were silently defeating.

The other configureState callsite in tests/smoke/gemini-companion.smoke.test.mjs:805
already uses correct keys (pluginDataEnv, sessionIdEnv) — no change there.

Class 5b in the agreed 6-commit Class sequence (1 → 2 → 3 → 4 → 5a → 5b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(#22): stabilize cancel smoke and scrub NUL fixture

* test(#22): enforce cancel docs status contract

* fix(#22): share coverage for discovered verbatim lib pairs

* fix(#22): address SonarCloud PR gate noise

* fix(#22): use SonarCloud automatic-analysis config

* fix(#22): tighten SonarCloud test exclusions

* Improve external review failure diagnostics

* fix(#22): address final external review findings

* Make external review scopes reliable

* Fix Sonar sort diagnostics

---------

Co-authored-by: Test User <test@example.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
seungpyoson and others added 13 commits April 29, 2026 14:59
Items 7 and 8 of post-merge hardening (#16):

- Add LITELLM_ and OLLAMA_ to PROVIDER_PREFIXES so router/proxy
  ecosystems cannot re-route Claude/Gemini target traffic away from
  the first-party provider. Document the explicit decision NOT to
  scrub HTTP_PROXY / HTTPS_PROXY / NO_PROXY (corporate networks
  depend on those for any outbound at all).
- Extend Claude and Gemini env-leak unit tests to assert LITELLM_
  / OLLAMA_ keys are stripped and proxy vars pass through.
- Add a withStateConfig(state, override, fn) helper in state tests
  and convert every short-timeout lock test to use it via try/finally
  so an early failure cannot leak the override to subsequent tests.
Items 5 and 6 of post-merge hardening (#16):

State lock (Item 5):
- emitLockDiagnostic() in both state.mjs copies prints a single-line
  stderr warning (prefix `state-lock:`) when reclaim is blocked by
  corrupt owner.json, by a cross-host owner during contention, or
  when a put-back failure leaves an orphan dir behind. Diagnostics
  never include owner.json contents (token in particular). Suppressed
  by CODEX_PLUGIN_QUIET_LOCK=1 for tests that assert clean stderr.
- acquireStateLock() now claims `.state.lock` directly after a
  successful tryReclaimStaleLock while still holding `.state.lock.gate`,
  so a successful reclaim cannot race a fresh acquirer through the
  gate window. Mutual exclusion semantics are unchanged; the path
  just stops throwing away work under burst contention.
- Tests assert the diagnostic fires for corrupt + cross-host scenarios
  and that no secret-shaped values from owner.json leak.

Working-tree privacy + git retry (Item 6):
- listIgnoredUntrackedFiles() retries `git ls-files --others --ignored
  --exclude-standard` up to 3 times with a 50ms backoff before failing
  closed, so transient `index.lock`/concurrent-gc races no longer
  abort runs.
- Skill docs (claude-cli-runtime, claude-result-handling) and
  gemini-result.md spell out that the gitignored privacy default
  applies only inside a git worktree; non-git source dirs run an
  unfiltered live walk with symlink/path safety only. New docs
  contract test asserts the distinction language is present.
- scope.mjs remains byte-identical across plugins (verified by
  plugin-copies-in-sync.test.mjs).
Item 1 of post-merge hardening: separate contractual JobRecord
persistence from diagnostic sidecar persistence so a failed
stdout.log/stderr.log write no longer clobbers a successful run with
finalization_failed.

Companion changes (Claude + Gemini executeRun):
- Phase 1 + 2 (writeJobFile + upsertJob) are CONTRACTUAL. On failure,
  attempt a best-effort fallback record built via buildJobRecord with
  errorMessage=finalization_failed: <detail> so onSpawn's running
  entry doesn't persist forever. Then exit non-zero with structured
  finalization_failed.
- Phase 3 (stdout/stderr sidecars + git-status-after sidecar) is
  DIAGNOSTIC. Failures emit a one-line stderr warning; the terminal
  status (and exit code) reflects the real run outcome.
- Both companions still dispose containment in the failure path.

Mock + smoke coverage:
- claude-mock.mjs: CLAUDE_MOCK_META_CONFLICT replaces the queued
  meta.json with a directory so writeJobFile rename fails.
- gemini-mock.mjs: GEMINI_MOCK_SIDECAR_CONFLICT and
  GEMINI_MOCK_META_CONFLICT mirror Claude. The Gemini mock cannot
  derive jobId from argv (Gemini doesn't pass --session-id), so it
  walks GEMINI_PLUGIN_DATA/state/*/jobs and picks the most recently
  modified queued meta to discover the active jobId.
- New Claude + Gemini smoke tests:
  * "sidecar write failures warn but preserve terminal status" —
    exit 0, persisted record completed, stderr warning emitted.
  * "meta-write conflict produces fallback failed record, no
    permanent running" — exit non-zero with finalization_failed,
    state.json shows no remaining queued/running entry.
- Existing CLAUDE_MOCK_SIDECAR_CONFLICT smoke test rewritten to
  match the new warning-not-fatal semantics.
Move historical smoke/review notes and implementation plans under docs/archive, keep active docs listed in README, and add archive guidance.
Item 2: cancel/cancelled/stale lifecycle semantics.
  classifyExecution now produces a `cancelled` terminal status when
  the target CLI exits via SIGTERM/SIGKILL/SIGINT/SIGHUP and timedOut
  is false. timedOut wins so wall-clock kills classify as
  failed/timeout. The lifecycle becomes one coherent model:

    operator runs cancel --job <id>
      → cmdCancel signals the target CLI's pid (verifyPidInfo gates)
      → target CLI exits with signal=SIGTERM
      → companion's spawn promise resolves with execution.signal set
      → executeRun's buildJobRecord call now passes signal+timedOut
      → classifyExecution returns status=cancelled
      → terminal record persists as cancelled (continuable)

  Both companions (Claude and Gemini) plumb signal+timedOut into
  buildJobRecord, and both job-record.mjs copies got matching
  classifyExecution updates. timeout exits get error_code=timeout
  for the first time.

  New unit coverage in tests/unit/job-record.test.mjs verifies all
  four cancel signals classify as cancelled, that timedOut wins over
  signal, and that gemini buildJobRecord agrees.

  The existing "active background job is visible as running and can
  be cancelled" smoke test now follows up with: poll until terminal,
  assert status=cancelled, assert default `status` lists it.

Item 4: status UX for continuable states.
  Default status filter now surfaces cancelled and stale alongside
  running/completed/failed because they are continuable terminal
  states (an operator queries `status` to decide whether to
  `continue --job`; hiding cancelled jobs by default broke that).
  Only `queued` (transient pre-spawn) is hidden from default; --all
  still shows it. Updated Gemini default-status smoke test accordingly.

Gemini parity note:
  Gemini's `cancel` subcommand remains `not_implemented`, but the
  shared lifecycle code path now correctly classifies a SIGTERM/
  SIGKILL exit (e.g., from an external `kill`) as cancelled, so the
  divergence is in the entry point, not the persisted lifecycle.
Replace the source-available non-commercial license with AGPL-3.0-only, update package/plugin metadata, and preserve upstream MIT attribution in NOTICE.
P2 follow-up addressing the #16 addendum: pre-commit npm test was
mutating the caller checkout into fixture git state — branch flipped
to "feature", seed/main/feature commits appeared in the caller repo,
and fixture files (foo.md, old.md, seed, seed.txt) showed as deleted
from the caller worktree.

Root cause: every fixture git invocation inherited the caller process's
GIT_DIR / GIT_WORK_TREE / GIT_INDEX_FILE / GIT_COMMON_DIR / GIT_PREFIX
env vars. When the test runner was launched from a pre-commit hook,
those vars pointed at the caller checkout, so the fixture's
`git init` / `git checkout -qb feature` ran against the caller repo
instead of the temp `cwd`.

Fix is layered:

1. tests/helpers/fixture-git.mjs (new): single source of truth for
   fixture git invocations. fixtureGitEnv() scrubs every dangerous
   GIT_* var (GIT_NAMESPACE, GIT_CEILING_DIRECTORIES,
   GIT_DISCOVERY_ACROSS_FILESYSTEM, GIT_OBJECT_DIRECTORY,
   GIT_ALTERNATE_OBJECT_DIRECTORIES, GIT_ATTR_SOURCE,
   GIT_REPLACE_REF_BASE, GIT_SHALLOW_FILE,
   GIT_CONFIG_PARAMETERS, GIT_CONFIG_COUNT,
   GIT_CONFIG_KEY_*/VALUE_*) and sets deterministic AUTHOR/COMMITTER
   identity. fixtureGit() pins spawnSync's cwd, passes
   `core.hooksPath=/dev/null`, and uses the sanitized env.
   fixtureSeedRepo() bundles `git init` + first commit.

2. scripts/ci/run-tests.mjs: scrubs the same GIT_* vars at the
   runner boundary so even legacy callsites are protected.

3. Per-file refactors: Claude/Gemini/identity/invariants smoke tests,
   Claude/Gemini e2e tests, and unit/scope.test.mjs all use
   fixtureSeedRepo / fixtureGit / fixtureGitEnv now.

4. Fast vs full test split (CLAUDE.md, package.json, run-tests.mjs,
   CI workflow): default `npm test` skips tests/unit/scope.test.mjs
   (155 real-git tests, ~140s on its own) so it fits the 60s
   pre-commit gate. CI sets CODEX_PLUGIN_FULL_TESTS=1 (and
   `npm run test:full` does the same locally) to include scope tests.

Acceptance criteria from the addendum:
- audit git fixture helpers for missing cwd → fixed
- restore process.chdir() in finally → no chdir in any test
- sanitize inherited GIT_* env for fixture commands → done at
  runner boundary AND per-fixture site
- ensure npm test never changes caller branch / creates fixture
  commits / leaves fixture files in caller worktree → enforced
Item 3 of post-merge hardening: a background worker that dies between
writing the `running` record (in onSpawn) and the executeRun
finalization leaves a permanent active-state record that pruneJobs()
refuses to evict — terminal history starves on busy workspaces.

reconcileActiveJobs() (new lib/reconcile.mjs, byte-identical across
Claude and Gemini) walks active records and promotes them to
status=stale when there is enough evidence the worker is gone:

- pid_info present + valid: verifyPidInfo. process_gone /
  starttime_mismatch / argv0_mismatch → stale (PID-reuse-safe;
  same identity check cmdCancel uses).
- pid_info missing or incomplete AND started_at older than the
  configurable orphan window (default 1h): stale (worker queued
  but never spawned, or pid_info was never captured).

Reconciliation NEVER deletes records. It writes a stale terminal
record via the canonical buildJobRecord path so consumers see
status=stale, error_code=stale_active_job, and an error_message
naming the reason. stale is in CONTINUABLE_STATUSES, so an operator
can resume via `continue --job <id>` instead of manually mutating
JSON.

Wired into both companions' cmdStatus so the lifecycle self-heals
without a separate reaper process. classifyExecution gained an
explicit `status: "stale"` short-circuit (mirrors the existing
`status: "running"` handling).

reconcile.mjs is added to plugin-copies-in-sync VERBATIM_FILES.

Tests:
- New tests/unit/reconcile.test.mjs covers fresh-job-leftalone,
  dead-pid → stale, missing pid_info reclaimed only after the
  orphan window, terminal stale records carry the invocation
  fields needed for `continue --job`, and Gemini parity.
…ical lib

Verification pass for #16: the new lib/reconcile.mjs added in Item 3
needed branch coverage to clear the 85% target, plus the shared-lib
coverage merger so the per-plugin file-level checks see the union of
both plugins' test exercises (matching the existing pattern for
identity/scope/etc.).

- scripts/ci/check-coverage.mjs: add reconcile.mjs to
  VERBATIM_SHARED_LIBS so coverage from Claude + Gemini reconcile
  tests is pooled per plugin file.
- scripts/ci/coverage-baseline.json: register reconcile.mjs
  baselines (lines 95.79 / branches 87.5 / functions 100) for
  both plugins.
- tests/unit/reconcile.test.mjs: new branch coverage for invalid
  started_at strings, state-summary-vs-meta disagreement, invalid
  pid (0/negative), missing argv0 falling through to age-based
  reclaim, and tunable orphan window.

Coverage now: ✓ coverage target met (85.00%) under
CODEX_PLUGIN_SKIP_SMOKE=1 COVERAGE_ENFORCE_TARGET=1 CODEX_PLUGIN_FULL_TESTS=1.
… MED 1)

Four reviewers (codex/kimi/glm/gemini) ran an adversarial pass on PR #21.
This commit addresses every finding I could substantiate with a repro and
folds the fixes onto the *class* of problem rather than patching instances.

Confirmed bugs and fixes:

BLOCKER 1 — finalization fallback clobbered good meta on state-only failure
  Repro: writeJobFile succeeds, upsertJob fails (lock timeout) → fallback
  re-issues writeJobFile unconditionally → "completed" meta becomes
  "failed/spawn_failed". Fix: split fallback per error side. metaError →
  rewrite both (state was rolled back too); stateError-only → preserve
  meta, retry upsertJob with the GOOD record before falling back.

BLOCKER 2 — reconcile TOCTOU race overwrote terminal completed records
  Repro: reconcile reads meta=running, classifies stale; worker writes
  completed mid-flight; reconcile then writes stale → "REAL_WORKER_RESULT"
  permanently lost. Fix: new commitJobRecordIfActive primitive in
  state.mjs holds the state lock across an in-lock CAS read of meta.json
  + the write; companion finalization moves to commitJobRecord so both
  writers serialize. Either order is safe — terminal always wins.

HIGH 1 — finalization_failed misclassified as spawn_failed
  classifyExecution short-circuited on errorMessage before signal/timedOut.
  Disk-full / lock-timeout finalization errors looked like missing-binary
  errors to monitoring. Fix: detect "finalization_failed:" prefix and emit
  error_code=finalization_failed.

HIGH 2/3 — GIT_CONFIG_GLOBAL leaked through 4 of 5 scrub callsites
  Repro: GIT_CONFIG_GLOBAL=/evil node --test … → fixture branch became
  "injected-master". Companion mutation-detection git would inherit
  safe.directory=*. Fix: extract STRIPPED_GIT_ENV_KEYS + cleanGitEnv into
  byte-identical lib/git-env.mjs (added to VERBATIM_FILES). Companions,
  scope.mjs, fixture helper, and runner all import the canonical list.
  Adds GIT_CONFIG_GLOBAL/SYSTEM, GIT_TRACE family, GIT_OPTIONAL_LOCKS,
  GIT_TERMINAL_PROMPT, GIT_PROTOCOL, GIT_AUTO_GC.

HIGH 4 — stale records from fresh-running orphans not actually continuable
  PR #21 claims stale is continuable, but onSpawn writes claude_session_id=null
  so reconcile carries null, so cmdContinue rejects with "no claude_session_id".
  Fix: cmdContinue now distinguishes the stale-orphan case and returns an
  actionable suggestion ("re-run from scratch via …") with the prior
  mode/cwd preserved.

HIGH 5 — git/workspace tests didn't scrub GIT_*
  Direct \`node --test tests/unit/git.test.mjs\` with caller GIT_DIR set
  hijacked the fixture into the wrong repo. Same gap on the production
  side: lib/git.mjs's git() forwarded process.env unscrubbed → the
  workspace-root resolution at every companion subcommand entry was
  vulnerable too. Fix: refactor both tests to use fixtureGit; add
  cleanGitEnv to lib/git.mjs.

MED 1 — cmdResult crashed unhandled EISDIR when meta.json was a directory
  Wrap readFileSync in cmdResult; return {ok:false, error:"read_failed",
  error_code:"EISDIR"} instead of an unhandled stacktrace.

Reviewer findings rejected with reasoning (see PR comment):
  - SIGTERM-trapped target classified as completed (not cancelled): inherent
    Node child_process limitation; closing it requires a marker-file dance
    out of scope for #16. Tracked as follow-up.
  - State-lock gate-kidnap → EEXIST crash: speculation, no repro; mkdirSync
    atomicity + gate ordering precludes the scenario.
  - Gemini cmdCancel not implemented: real product gap, but pre-existing
    since af303f0; not introduced by PR #21. Tracked as follow-up.

New tests:
  - tests/unit/git-env.test.mjs (canonical strip list + GIT_CONFIG_GLOBAL leak repro)
  - tests/unit/commit-job-record.test.mjs (atomic + CAS primitives)
  - tests/unit/reconcile.test.mjs adds TOCTOU regression guard
  - tests/unit/job-record.test.mjs adds finalization_failed classification
  - tests/smoke/claude-companion.smoke.test.mjs adds result-EISDIR friendly error

Net: 565 unit tests / 0 fail (was 529); coverage baseline met (state.mjs and
new git-env.mjs ratchet up; reconcile.mjs branch baseline lowered to 81 since
3 branches moved into commitJobRecordIfActive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seungpyoson seungpyoson force-pushed the fix/016-post-merge-hardening branch from 25f9cf3 to 3838d39 Compare April 29, 2026 06:29
@sonarqubecloud
Copy link
Copy Markdown

@seungpyoson seungpyoson marked this pull request as ready for review April 29, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Post-merge follow-ups: lifecycle, state, and review hardening polish

1 participant