Skip to content

ci: fix cli-e2e flake via job isolation + parallelize chaos sims#626

Merged
erskingardner merged 1 commit into
masterfrom
claude/strange-matsumoto-35683a
Jun 26, 2026
Merged

ci: fix cli-e2e flake via job isolation + parallelize chaos sims#626
erskingardner merged 1 commit into
masterfrom
claude/strange-matsumoto-35683a

Conversation

@erskingardner

@erskingardner erskingardner commented Jun 26, 2026

Copy link
Copy Markdown
Member

Why

Two test-suite problems, tackled together.

Flakiness. Every recent red CI run was the same test — groups_leave_publishes_self_remove (crates/cli/tests/cli.rs) — and it was load-induced, not a logic bug. It passes in ~13s in isolation but blows its 90s convergence-poll timeout under CI. Root cause is CPU oversubscription: the subprocess-heavy darkmatter-cli::cli suite ran inside the 4-way unit matrix, where each test's in-process relay + cold dm subprocesses competed with the full-speed unit tests for 4 vCPUs, starving account-worker startup (issue marmot-protocol/mdk#642).

Slowness. CI's critical path was the ~9.7-min Simulator job, almost entirely two chaos-family tests that ran 22–24 independent 20-client simulations in a serial for … .await loop, leaving every core but one idle.

What changed

  • .github/workflows/ci.yml — pull darkmatter-cli::cli out of the unit matrix into its own dedicated cli-e2e job (capped at 2 threads via the cli-e2e test-group), so its subprocess fan-out never starves the fast unit tests. Run the nextest jobs under a new ci profile.
  • .config/nextest.toml — add a ci profile with retries (1 global, 2 for the cli e2e suite) + a hung-test slow-timeout backstop as a safety net for transient infra hiccups; raise leak-timeout to 500ms so the multi-threaded chaos tests aren't falsely flagged "leaky" when they wind down just past the 100ms default under load.
  • crates/cgka-conformance-simulator/tests/canonical_scenarios.rs — parallelize both chaos-family tests via a for_each_chaos_case_concurrently helper (one tokio::spawn per independent case; case panics re-raised with their original message/backtrace intact).

Results

Before After
groups_leave_publishes_self_remove timeout/FAIL under load 13s, passes (isolated + retries)
convergence_chaos…generates_specs 176s ~55s
convergence_chaos…seed_changes 174s ~53s
Full cli e2e suite flaky in matrix 73/73 green, 165s, off critical path

just fast-ci is green (fmt + check + clippy, default and otlp).

Reviewer notes

  • The dedicated cli-e2e job keeps the existing cli-e2e test-group cap (2 threads); the real-relay test self-skips there and still has its own Relay CLI E2E job with the docker relay stack.
  • Retries are a safety net, not the fix — a retried-then-passed test is reported FLAKY, so masking still surfaces. The real fix is the job isolation.
  • I deliberately did not rewrite the cli poll helpers to drive convergence through a long-lived dmd daemon: the flake is already resolved by isolation + retries, the suite is off the critical path, and a daemon would race the background convergence timer the per-process design intentionally avoids — risking new flakiness. If the cli suite ever becomes the critical path, partitioning that job (pure CI config) is the safe next lever.

🤖 Generated with Claude Code


Open in Stage

Summary by CodeRabbit

  • Bug Fixes

    • Improved CI stability for subprocess-heavy CLI tests by reducing “startup timed out” flakes and retrying hung runs sooner.
    • Split CLI end-to-end tests into a dedicated job so they no longer slow down or interfere with the main test run.
    • Updated test execution settings for better handling of long-running and flaky scenarios in CI.
  • Tests

    • Made simulator scenario checks run concurrently, improving test coverage execution efficiency and surfacing failures more consistently.

The recurring CI red was always the same test
(`groups_leave_publishes_self_remove`), and it was load-induced rather
than a logic bug: it passes in ~13s in isolation but blows its 90s
convergence-poll timeout under CI. Root cause is CPU oversubscription —
the subprocess-heavy `darkmatter-cli::cli` suite ran inside the 4-way
unit matrix, where each test's in-process relay + cold `dm` subprocesses
competed with the full-speed unit tests for 4 vCPUs, starving worker
startup (issue #103).

Separately, CI's critical path was the 9.7-min Simulator job, almost
entirely two chaos-family tests that ran 22-24 independent 20-client
simulations in a serial `for … .await` loop.

Changes:

- ci.yml: move `darkmatter-cli::cli` out of the unit matrix into its own
  dedicated `cli-e2e` job (capped at 2 threads by the `cli-e2e`
  test-group), so its subprocess fan-out never starves the fast unit
  tests. Run the nextest jobs under a new `ci` profile.
- nextest.toml: add a `ci` profile with retries (1 global, 2 for the cli
  e2e suite) and a hung-test `slow-timeout` backstop as a safety net for
  transient infra hiccups; raise `leak-timeout` to 500ms so the
  multi-threaded chaos tests are not falsely flagged "leaky" when they
  wind down just past the 100ms default under load.
- canonical_scenarios.rs: parallelize both chaos-family tests with a
  `for_each_chaos_case_concurrently` helper (one tokio task per
  independent case; case panics re-raised with their original message).

Measured: chaos tests 176s/174s -> ~55s/53s each; full cli suite 73/73
green in 165s in isolation; previously-flaky test stable. `just fast-ci`
green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The PR updates nextest configuration, CI workflow commands, and conformance simulator tests. It adds a ci profile with retries and CLI-specific overrides, splits CLI E2E execution into a dedicated CI job, and runs chaos-case simulator checks concurrently.

Changes

Nextest and CI orchestration

Layer / File(s) Summary
Nextest timing and retry profile
.config/nextest.toml
profile.default adds leak-timeout, and profile.ci sets retries plus a darkmatter-cli override for CLI binary tests.
CI job routing
.github/workflows/ci.yml
The workspace test excludes darkmatter-cli::cli, the new cli-e2e job runs that suite, and the relay-e2e and conformance jobs use --profile ci.
Concurrent chaos-case simulator tests
crates/cgka-conformance-simulator/tests/canonical_scenarios.rs
Adds for_each_chaos_case_concurrently, switches the two chaos-family tests to it, and uses the multi-thread Tokio runtime.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the two main changes: isolating the CLI E2E job to fix CI flakes and parallelizing chaos simulations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/strange-matsumoto-35683a

Comment @coderabbitai help to get the list of available commands.

@stage-review

stage-review Bot commented Jun 26, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 3 individual chapters for you:

Title
1 Configure nextest CI profile and timeouts
2 Isolate CLI e2e tests into dedicated job
3 Parallelize chaos family simulator tests
Open in Stage

Chapters generated by Stage for commit ecf3442 on Jun 26, 2026 5:02pm UTC.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

145-176: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Wire the new cli-e2e job into required checks.

Splitting the CLI suite into its own job only protects main if cli-e2e is part of the merge gate. Confirm it’s added to branch-protection required status checks (and any aggregating needs: gate job), otherwise a CLI e2e failure will no longer block merges.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 145 - 176, The new cli-e2e workflow
job is not yet guaranteed to block merges if it is missing from the required
checks. Update the CI gating so cli-e2e is included in the branch-protection
required status checks and in any aggregate needs-based gate job, using the
cli-e2e job name as the stable identifier to wire it in alongside the existing
CLI test jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 145-176: The new cli-e2e workflow job is not yet guaranteed to
block merges if it is missing from the required checks. Update the CI gating so
cli-e2e is included in the branch-protection required status checks and in any
aggregate needs-based gate job, using the cli-e2e job name as the stable
identifier to wire it in alongside the existing CLI test jobs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fabfc3b3-f1a7-4f8f-8239-9103a266ac42

📥 Commits

Reviewing files that changed from the base of the PR and between f62b645 and ecf3442.

📒 Files selected for processing (3)
  • .config/nextest.toml
  • .github/workflows/ci.yml
  • crates/cgka-conformance-simulator/tests/canonical_scenarios.rs

@erskingardner

Copy link
Copy Markdown
Member Author

Re: CodeRabbit's nitpick about wiring cli-e2e into required checks —

Verified against the repo's master ruleset ("Master Safe"): it enforces pull_request, required_linear_history, non_fast_forward, and deletion, but has no required_status_checks rule. So no CI job gates merges today — not the new cli-e2e job, and not the Rust tests (N/4) jobs the CLI suite used to live in. The split therefore introduces no regression in merge-gating; there's no required-checks list to fall out of sync with.

No code change: whether CI should block merges is a repo-policy decision that lives in the ruleset, not the workflow YAML. If you do want to start gating, add these contexts to a required_status_checks rule: CLI e2e (mock relay), Rust tests (1/4)(4/4), Relay runtime tests, Relay CLI E2E, Simulator and vectors, Rust check, Rust clippy, Rust format, Tamarin proofs.

🤖 Addressed by Claude Code

@erskingardner erskingardner merged commit 745959e into master Jun 26, 2026
13 checks passed
@erskingardner erskingardner deleted the claude/strange-matsumoto-35683a branch June 26, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant