Skip to content

[HOLD] [TRUNK-17917] Add failure count monitor documentation#529

Draft
samgutentag wants to merge 2 commits intomainfrom
sam-gutentag/failure-count-monitor-docs
Draft

[HOLD] [TRUNK-17917] Add failure count monitor documentation#529
samgutentag wants to merge 2 commits intomainfrom
sam-gutentag/failure-count-monitor-docs

Conversation

@samgutentag
Copy link
Copy Markdown
Member

@samgutentag samgutentag commented Apr 6, 2026

On hold: The failure count monitor is behind a LaunchDarkly feature flag and not yet available to customers. This PR is approved and ready to merge when the flag is flipped. See TRUNK-17920 for screenshot follow-up work.

Summary

  • New docs page for the failure count monitor (flaky-tests/detection/failure-count-monitor.md), a detection mode that flags tests as soon as they accumulate a configured number of failures on monitored branches
  • Updated detection overview (flaky-tests/detection/README.md) to include failure count in the monitor types table and status priority matrix
  • Updated summary.md table of contents

Linear tickets

Engineering authors

  • @mb1206 (Meredith Burkle) -- all implementation PRs

Context links

Files changed

  • flaky-tests/detection/failure-count-monitor.md (new)
  • flaky-tests/detection/README.md (modified)
  • summary.md (modified)

Open questions

  • Separate PR needed for the "Threshold Monitor" → "Failure Rate Monitor" UI rename (trunk2 PRs #3459, #3463).

Test plan

  • Check GitBook preview renders correctly on the change request branch
  • Verify internal links between failure count monitor page and detection overview
  • Confirm branch pattern cross-reference to threshold monitor page works
  • @mb1206 reviews for technical accuracy

🤖 Generated with Claude Code

@trunk-staging-io
Copy link
Copy Markdown

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@samgutentag samgutentag requested review from mb1206 April 6, 2026 18:31
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 6, 2026

Claude finished @samgutentag's task in 1m 57s —— View job


PR Review

  • Gather context
  • Read changed files
  • Review for typos, grammar, formatting, and documentation best practices
  • Post findings

Overall this is a well-written, thorough doc. Structure, cross-links, and content are all solid. Found 4 minor issues:

flaky-tests/detection/failure-count-monitor.md

  1. 3× double hyphens -- should be em dashes (lines 23, 71, 75) — inconsistent with threshold-monitor.md, which uses in the exact same sentence patterns. Suggestions posted inline.
  2. Line 54, step 4: "The developer fixes the test and merges the fix." — "fixes" / "fix" is repetitive. Suggestion posted inline.

flaky-tests/detection/README.md and summary.md — no issues.

@samgutentag samgutentag changed the title Add failure count monitor documentation [TRUNK-17917] Add failure count monitor documentation Apr 6, 2026
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: Claude code review passed.


For example, with a resolution timeout of 2 hours, a test that was flagged at 3:00 PM will resolve at 5:00 PM if no new failures occur. If a new failure arrives at 4:30 PM, the clock resets, and the test will not resolve until 6:30 PM.

Choose a resolution timeout that gives your team enough time to verify a fix has landed. A short timeout (e.g., 30 minutes) resolves quickly but may prematurely clear tests that fail intermittently. A longer timeout (e.g., 24 hours) is more conservative and ensures the test stays flagged until it has been clean for a full day.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to flag here that the resolution timeout cannot be shorter than the detection lookback window. its maybe obvious but could trip folks up

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call, added in 15adbad: 'The resolution timeout must be at least as long as the detection window.'


# Failure Count Monitor

The failure count monitor flags a test the moment it accumulates a configured number of failures on monitored branches within a rolling time window. Unlike the threshold monitor, which requires a failure *rate* calculated over many runs, the failure count monitor reacts to individual failures without needing a minimum sample size or a percentage calculation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're moving toward renaming "threshold monitors" to be "failure rate monitors" so that it has parity with "failure count monitors" https://trunk-io.slack.com/archives/C08AEDGMZNH/p1775066930429269?thread_ts=1775066814.530469&cid=C08AEDGMZNH

Copy link
Copy Markdown
Member Author

@samgutentag samgutentag Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, tracking this as a separate PR. trunk2 #3459 (UI rename) is already merged and #3463 (enum rename) is open. Will do a docs pass once the enum rename lands.

☝️ notes for me to do another pass when ready!

New docs page for the failure count monitor, a new flake detection mode
that flags tests as soon as they accumulate a configured number of failures
on monitored branches. Updates the detection overview to include the new
monitor type in the comparison table and priority matrix.

Addresses review feedback: fixed repetitive wording in example walkthrough.
Em dash suggestions declined per style guide (double hyphens are intentional).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@samgutentag samgutentag force-pushed the sam-gutentag/failure-count-monitor-docs branch from 474c711 to f277770 Compare April 6, 2026 18:41
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 6, 2026

Claude finished @samgutentag's task in 1m 30s —— View job


PR Review

  • Gather context
  • Read changed files
  • Review for typos, grammar, formatting, and documentation best practices
  • Post findings

Good overall — the docs are clear, well-structured, and the cross-links all check out. Three previous inline suggestions (lines 54, 71, 75) appear to have been resolved. One formatting issue remains:

flaky-tests/detection/failure-count-monitor.md

  • Line 23: -- double hyphens still present. Inline suggestion posted.

flaky-tests/detection/README.md and summary.md — no issues.

On mb1206's suggestions:

  • Line 79 (resolution timeout constraint): Adding a note that the resolution timeout cannot be shorter than the detection window would be a useful guardrail for users. Not a blocker, but worth addressing before merge.
  • Line 7 (threshold → failure rate rename): Acknowledged in the PR body as a separate PR — no action needed here.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: Claude code review passed.

Resolution timeout must be at least as long as the detection window.
Flagged by Meredith in PR review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@mb1206 mb1206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! you may also be able to use screenshots from the preview for editing? we show how many tests have N many failures in the lookback window, and also a histogram for when tests' last failure was if thats helpful

@samgutentag samgutentag changed the title [TRUNK-17917] Add failure count monitor documentation [HOLD] [TRUNK-17917] Add failure count monitor documentation Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants