microsoft · YuanyuanTian-hh · Apr 30, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/.github/docs/disk-benchmarks-aa.md b/.github/docs/disk-benchmarks-aa.md
@@ -0,0 +1,77 @@
+# DiskANN A/A Benchmark Stability Test
+
+> Companion documentation for
+> [`.github/workflows/disk-benchmarks-aa.yml`](../workflows/disk-benchmarks-aa.yml).
+
+## Purpose
+
+The A/A benchmark runs **main vs. main** — the same code is built once and
+run twice. Its purpose is to detect **environment noise** on the CI runners,
+not code regressions. If the results between two identical runs differ beyond
+configured thresholds, it indicates that the runner environment is too noisy
+for reliable benchmarking.
+
+## Schedule
+
+| Trigger | Details |
+|---------|---------|
+| **Cron** | Daily at **9:00 AM UTC** (`0 9 * * *`) |
+| **Manual** | Can be triggered via `workflow_dispatch` for debugging |
+
+Only one run is allowed at a time (`cancel-in-progress: true`).
+
+## Datasets
+
+Two datasets are benchmarked in parallel via a matrix strategy
+(with `fail-fast: false`, so both always run):
+
+| Dataset | Config | Archive |
+|---------|--------|---------|
+| `wikipedia-100K` | `wikipedia-100K-disk-index.json` | `wikipedia-100K.tar.gz` |
+| `openai-100K` | `openai-100K-disk-index.json` | `openai-100K.tar.gz` |
+
+Config and tolerance files live in
+[`diskann-benchmark/perf_test_inputs/`](../../diskann-benchmark/perf_test_inputs/).
+
+## Tolerance Thresholds
+
+Defined in
+[`disk-index-tolerances.json`](../../diskann-benchmark/perf_test_inputs/disk-index-tolerances.json):
+
+| Metric | Allowed Regression |
+|--------|--------------------|
+| Build time | 10 % |
+| QPS | 10 % |
+| Recall | 1 % |
+| Mean I/Os | 1 % |
+| Mean comparisons | 1 % |
+| Mean latency | 15 % |
+| P95 latency | 15 % |
+
+## Failure Notification
+
+Our reliability promise is **95%** — meaning 1 failure in 20 runs is expected
+due to environment noise. To avoid unnecessary alerts, a GitHub issue is only
+created when the failure rate exceeds 5% across the last 20 completed runs.
+
+When the threshold is breached, the `notify-on-failure` job:
+
+1. **Creates a GitHub issue** with:
+   - Title: `[Benchmark A/A] Daily stability test failed – <date>`
+   - A link to the failed workflow run
+   - The recent failure rate (e.g., `3/20 (15.0%)`)
+   - Labels: `benchmark`, `A/A-failure`
+2. **Tags** `@microsoft/diskann-disk-maintainers` for review.
+
+The team should then inspect the uploaded artifacts (retained 30 days) to
+determine whether thresholds need tuning or there is a runner environment
+issue.
+
+## Comparison with A/B Benchmarks
+
+The A/A test should be distinguished from the **A/B benchmark** workflow
+([`disk-benchmarks.yml`](../workflows/disk-benchmarks.yml)), which compares
+a **PR branch vs. a baseline** (usually `main`). The A/B workflow builds two
+separate binaries and is designed to catch performance regressions in code
+changes. In contrast, the A/A test builds only once and runs twice, so any
+differences are purely due to environment variability.
diff --git a/.github/workflows/disk-benchmarks-aa.yml b/.github/workflows/disk-benchmarks-aa.yml
@@ -6,6 +6,8 @@
 # Runs main vs main at 9 AM UTC every day to detect environment noise.
 # If any threshold is breached, a GitHub issue is created to notify @microsoft/diskann-admin.
 # Can also be triggered manually for debugging.
+#
+# For more details see .github/docs/disk-benchmarks-aa.md
 
 name: Disk Benchmarks (A/A)
 
@@ -102,14 +104,39 @@ jobs:
             diskann_rust/target/tmp/${{ matrix.dataset }}_baseline.json
           retention-days: 30
 
-  # Notify diskann-admin on A/A failure
+  # Notify diskann-disk-maintainers on A/A failure — but only when the failure
+  # rate exceeds 5% over the last 20 runs. Our reliability promise is 95%, so
+  # 1 failure in 20 runs is expected and should not trigger a notification.
   notify-on-failure:
     name: Notify on A/A Failure
     needs: [aa-benchmark]
     runs-on: ubuntu-latest
     if: failure()
     steps:
+      - name: Check recent failure rate
+        id: check-rate
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const { data: runs } = await github.rest.actions.listWorkflowRuns({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              workflow_id: 'disk-benchmarks-aa.yml',
+              per_page: 20,
+              status: 'completed',
+            });
+            const total = runs.workflow_runs.length;
+            const failures = runs.workflow_runs.filter(r => r.conclusion === 'failure').length;
+            const failureRate = total > 0 ? failures / total : 0;
+            console.log(`Recent A/A runs: ${total}, failures: ${failures}, rate: ${(failureRate * 100).toFixed(1)}%`);
+            // Only notify if failure rate exceeds our 5% reliability budget
+            core.setOutput('should_notify', failureRate > 0.05 ? 'true' : 'false');
+            core.setOutput('failure_rate', `${(failureRate * 100).toFixed(1)}%`);
+            core.setOutput('failures', `${failures}`);
+            core.setOutput('total', `${total}`);
+
       - name: Create GitHub issue for A/A failure
+        if: steps.check-rate.outputs.should_notify == 'true'
         uses: actions/github-script@v7
         with:
           script: |
@@ -126,6 +153,7 @@ jobs:
                 `This indicates environment noise exceeded the configured thresholds.`,
                 ``,
                 `**Run:** ${runUrl}`,
+                `**Recent failure rate:** ${{ steps.check-rate.outputs.failures }}/${{ steps.check-rate.outputs.total }} (${{ steps.check-rate.outputs.failure_rate }}) — exceeds 5% reliability budget`,
                 ``,
                 `Please review the benchmark artifacts and determine if thresholds need tuning`,
                 `or if there is a runner environment issue.`,