diff --git a/.github/docs/disk-benchmarks-aa.md b/.github/docs/disk-benchmarks-aa.md new file mode 100644 index 000000000..4d5931280 --- /dev/null +++ b/.github/docs/disk-benchmarks-aa.md @@ -0,0 +1,77 @@ +# DiskANN A/A Benchmark Stability Test + +> Companion documentation for +> [`.github/workflows/disk-benchmarks-aa.yml`](../workflows/disk-benchmarks-aa.yml). + +## Purpose + +The A/A benchmark runs **main vs. main** — the same code is built once and +run twice. Its purpose is to detect **environment noise** on the CI runners, +not code regressions. If the results between two identical runs differ beyond +configured thresholds, it indicates that the runner environment is too noisy +for reliable benchmarking. + +## Schedule + +| Trigger | Details | +|---------|---------| +| **Cron** | Daily at **9:00 AM UTC** (`0 9 * * *`) | +| **Manual** | Can be triggered via `workflow_dispatch` for debugging | + +Only one run is allowed at a time (`cancel-in-progress: true`). + +## Datasets + +Two datasets are benchmarked in parallel via a matrix strategy +(with `fail-fast: false`, so both always run): + +| Dataset | Config | Archive | +|---------|--------|---------| +| `wikipedia-100K` | `wikipedia-100K-disk-index.json` | `wikipedia-100K.tar.gz` | +| `openai-100K` | `openai-100K-disk-index.json` | `openai-100K.tar.gz` | + +Config and tolerance files live in +[`diskann-benchmark/perf_test_inputs/`](../../diskann-benchmark/perf_test_inputs/). + +## Tolerance Thresholds + +Defined in +[`disk-index-tolerances.json`](../../diskann-benchmark/perf_test_inputs/disk-index-tolerances.json): + +| Metric | Allowed Regression | +|--------|--------------------| +| Build time | 10 % | +| QPS | 10 % | +| Recall | 1 % | +| Mean I/Os | 1 % | +| Mean comparisons | 1 % | +| Mean latency | 15 % | +| P95 latency | 15 % | + +## Failure Notification + +Our reliability promise is **95%** — meaning 1 failure in 20 runs is expected +due to environment noise. To avoid unnecessary alerts, a GitHub issue is only +created when the failure rate exceeds 5% across the last 20 completed runs. + +When the threshold is breached, the `notify-on-failure` job: + +1. **Creates a GitHub issue** with: + - Title: `[Benchmark A/A] Daily stability test failed – ` + - A link to the failed workflow run + - The recent failure rate (e.g., `3/20 (15.0%)`) + - Labels: `benchmark`, `A/A-failure` +2. **Tags** `@microsoft/diskann-disk-maintainers` for review. + +The team should then inspect the uploaded artifacts (retained 30 days) to +determine whether thresholds need tuning or there is a runner environment +issue. + +## Comparison with A/B Benchmarks + +The A/A test should be distinguished from the **A/B benchmark** workflow +([`disk-benchmarks.yml`](../workflows/disk-benchmarks.yml)), which compares +a **PR branch vs. a baseline** (usually `main`). The A/B workflow builds two +separate binaries and is designed to catch performance regressions in code +changes. In contrast, the A/A test builds only once and runs twice, so any +differences are purely due to environment variability. diff --git a/.github/workflows/disk-benchmarks-aa.yml b/.github/workflows/disk-benchmarks-aa.yml index 7057cade8..800acbdfb 100644 --- a/.github/workflows/disk-benchmarks-aa.yml +++ b/.github/workflows/disk-benchmarks-aa.yml @@ -6,6 +6,8 @@ # Runs main vs main at 9 AM UTC every day to detect environment noise. # If any threshold is breached, a GitHub issue is created to notify @microsoft/diskann-admin. # Can also be triggered manually for debugging. +# +# For more details see .github/docs/disk-benchmarks-aa.md name: Disk Benchmarks (A/A) @@ -102,14 +104,39 @@ jobs: diskann_rust/target/tmp/${{ matrix.dataset }}_baseline.json retention-days: 30 - # Notify diskann-admin on A/A failure + # Notify diskann-disk-maintainers on A/A failure — but only when the failure + # rate exceeds 5% over the last 20 runs. Our reliability promise is 95%, so + # 1 failure in 20 runs is expected and should not trigger a notification. notify-on-failure: name: Notify on A/A Failure needs: [aa-benchmark] runs-on: ubuntu-latest if: failure() steps: + - name: Check recent failure rate + id: check-rate + uses: actions/github-script@v7 + with: + script: | + const { data: runs } = await github.rest.actions.listWorkflowRuns({ + owner: context.repo.owner, + repo: context.repo.repo, + workflow_id: 'disk-benchmarks-aa.yml', + per_page: 20, + status: 'completed', + }); + const total = runs.workflow_runs.length; + const failures = runs.workflow_runs.filter(r => r.conclusion === 'failure').length; + const failureRate = total > 0 ? failures / total : 0; + console.log(`Recent A/A runs: ${total}, failures: ${failures}, rate: ${(failureRate * 100).toFixed(1)}%`); + // Only notify if failure rate exceeds our 5% reliability budget + core.setOutput('should_notify', failureRate > 0.05 ? 'true' : 'false'); + core.setOutput('failure_rate', `${(failureRate * 100).toFixed(1)}%`); + core.setOutput('failures', `${failures}`); + core.setOutput('total', `${total}`); + - name: Create GitHub issue for A/A failure + if: steps.check-rate.outputs.should_notify == 'true' uses: actions/github-script@v7 with: script: | @@ -126,6 +153,7 @@ jobs: `This indicates environment noise exceeded the configured thresholds.`, ``, `**Run:** ${runUrl}`, + `**Recent failure rate:** ${{ steps.check-rate.outputs.failures }}/${{ steps.check-rate.outputs.total }} (${{ steps.check-rate.outputs.failure_rate }}) — exceeds 5% reliability budget`, ``, `Please review the benchmark artifacts and determine if thresholds need tuning`, `or if there is a runner environment issue.`,