ci: fix cli-e2e flake via job isolation + parallelize chaos sims#626
Conversation
The recurring CI red was always the same test (`groups_leave_publishes_self_remove`), and it was load-induced rather than a logic bug: it passes in ~13s in isolation but blows its 90s convergence-poll timeout under CI. Root cause is CPU oversubscription — the subprocess-heavy `darkmatter-cli::cli` suite ran inside the 4-way unit matrix, where each test's in-process relay + cold `dm` subprocesses competed with the full-speed unit tests for 4 vCPUs, starving worker startup (issue #103). Separately, CI's critical path was the 9.7-min Simulator job, almost entirely two chaos-family tests that ran 22-24 independent 20-client simulations in a serial `for … .await` loop. Changes: - ci.yml: move `darkmatter-cli::cli` out of the unit matrix into its own dedicated `cli-e2e` job (capped at 2 threads by the `cli-e2e` test-group), so its subprocess fan-out never starves the fast unit tests. Run the nextest jobs under a new `ci` profile. - nextest.toml: add a `ci` profile with retries (1 global, 2 for the cli e2e suite) and a hung-test `slow-timeout` backstop as a safety net for transient infra hiccups; raise `leak-timeout` to 500ms so the multi-threaded chaos tests are not falsely flagged "leaky" when they wind down just past the 100ms default under load. - canonical_scenarios.rs: parallelize both chaos-family tests with a `for_each_chaos_case_concurrently` helper (one tokio task per independent case; case panics re-raised with their original message). Measured: chaos tests 176s/174s -> ~55s/53s each; full cli suite 73/73 green in 165s in isolation; previously-flaky test stable. `just fast-ci` green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WalkthroughThe PR updates nextest configuration, CI workflow commands, and conformance simulator tests. It adds a ci profile with retries and CLI-specific overrides, splits CLI E2E execution into a dedicated CI job, and runs chaos-case simulator checks concurrently. ChangesNextest and CI orchestration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Ready to review this PR? Stage has broken it down into 3 individual chapters for you:
Chapters generated by Stage for commit ecf3442 on Jun 26, 2026 5:02pm UTC. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)
145-176: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winWire the new
cli-e2ejob into required checks.Splitting the CLI suite into its own job only protects
mainifcli-e2eis part of the merge gate. Confirm it’s added to branch-protection required status checks (and any aggregatingneeds:gate job), otherwise a CLI e2e failure will no longer block merges.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yml around lines 145 - 176, The new cli-e2e workflow job is not yet guaranteed to block merges if it is missing from the required checks. Update the CI gating so cli-e2e is included in the branch-protection required status checks and in any aggregate needs-based gate job, using the cli-e2e job name as the stable identifier to wire it in alongside the existing CLI test jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 145-176: The new cli-e2e workflow job is not yet guaranteed to
block merges if it is missing from the required checks. Update the CI gating so
cli-e2e is included in the branch-protection required status checks and in any
aggregate needs-based gate job, using the cli-e2e job name as the stable
identifier to wire it in alongside the existing CLI test jobs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fabfc3b3-f1a7-4f8f-8239-9103a266ac42
📒 Files selected for processing (3)
.config/nextest.toml.github/workflows/ci.ymlcrates/cgka-conformance-simulator/tests/canonical_scenarios.rs
|
Re: CodeRabbit's nitpick about wiring Verified against the repo's No code change: whether CI should block merges is a repo-policy decision that lives in the ruleset, not the workflow YAML. If you do want to start gating, add these contexts to a 🤖 Addressed by Claude Code |
Why
Two test-suite problems, tackled together.
Flakiness. Every recent red CI run was the same test —
groups_leave_publishes_self_remove(crates/cli/tests/cli.rs) — and it was load-induced, not a logic bug. It passes in ~13s in isolation but blows its 90s convergence-poll timeout under CI. Root cause is CPU oversubscription: the subprocess-heavydarkmatter-cli::clisuite ran inside the 4-way unit matrix, where each test's in-process relay + colddmsubprocesses competed with the full-speed unit tests for 4 vCPUs, starving account-worker startup (issue marmot-protocol/mdk#642).Slowness. CI's critical path was the ~9.7-min Simulator job, almost entirely two chaos-family tests that ran 22–24 independent 20-client simulations in a serial
for … .awaitloop, leaving every core but one idle.What changed
.github/workflows/ci.yml— pulldarkmatter-cli::cliout of the unit matrix into its own dedicatedcli-e2ejob (capped at 2 threads via thecli-e2etest-group), so its subprocess fan-out never starves the fast unit tests. Run the nextest jobs under a newciprofile..config/nextest.toml— add aciprofile with retries (1 global, 2 for the cli e2e suite) + a hung-testslow-timeoutbackstop as a safety net for transient infra hiccups; raiseleak-timeoutto 500ms so the multi-threaded chaos tests aren't falsely flagged "leaky" when they wind down just past the 100ms default under load.crates/cgka-conformance-simulator/tests/canonical_scenarios.rs— parallelize both chaos-family tests via afor_each_chaos_case_concurrentlyhelper (onetokio::spawnper independent case; case panics re-raised with their original message/backtrace intact).Results
groups_leave_publishes_self_removeconvergence_chaos…generates_specsconvergence_chaos…seed_changesjust fast-ciis green (fmt + check + clippy, default and otlp).Reviewer notes
cli-e2ejob keeps the existingcli-e2etest-group cap (2 threads); the real-relay test self-skips there and still has its ownRelay CLI E2Ejob with the docker relay stack.dmddaemon: the flake is already resolved by isolation + retries, the suite is off the critical path, and a daemon would race the background convergence timer the per-process design intentionally avoids — risking new flakiness. If the cli suite ever becomes the critical path, partitioning that job (pure CI config) is the safe next lever.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests