Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .github/docs/disk-benchmarks-aa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# DiskANN A/A Benchmark Stability Test

> Companion documentation for
> [`.github/workflows/disk-benchmarks-aa.yml`](../workflows/disk-benchmarks-aa.yml).

## Purpose

The A/A benchmark runs **main vs. main** — the same code is built once and
run twice. Its purpose is to detect **environment noise** on the CI runners,
not code regressions. If the results between two identical runs differ beyond
configured thresholds, it indicates that the runner environment is too noisy
for reliable benchmarking.

## Schedule

| Trigger | Details |
|---------|---------|
| **Cron** | Daily at **9:00 AM UTC** (`0 9 * * *`) |
| **Manual** | Can be triggered via `workflow_dispatch` for debugging |

Only one run is allowed at a time (`cancel-in-progress: true`).

## Datasets

Two datasets are benchmarked in parallel via a matrix strategy
(with `fail-fast: false`, so both always run):

| Dataset | Config | Archive |
|---------|--------|---------|
| `wikipedia-100K` | `wikipedia-100K-disk-index.json` | `wikipedia-100K.tar.gz` |
| `openai-100K` | `openai-100K-disk-index.json` | `openai-100K.tar.gz` |

Config and tolerance files live in
[`diskann-benchmark/perf_test_inputs/`](../../diskann-benchmark/perf_test_inputs/).

## Tolerance Thresholds

Defined in
[`disk-index-tolerances.json`](../../diskann-benchmark/perf_test_inputs/disk-index-tolerances.json):

| Metric | Allowed Regression |
|--------|--------------------|
| Build time | 10 % |
| QPS | 10 % |
| Recall | 1 % |
| Mean I/Os | 1 % |
| Mean comparisons | 1 % |
| Mean latency | 15 % |
| P95 latency | 15 % |

## Failure Notification

Our reliability promise is **95%** — meaning 1 failure in 20 runs is expected
due to environment noise. To avoid unnecessary alerts, a GitHub issue is only
created when the failure rate exceeds 5% across the last 20 completed runs.

When the threshold is breached, the `notify-on-failure` job:

1. **Creates a GitHub issue** with:
- Title: `[Benchmark A/A] Daily stability test failed – <date>`
- A link to the failed workflow run
- The recent failure rate (e.g., `3/20 (15.0%)`)
- Labels: `benchmark`, `A/A-failure`
2. **Tags** `@microsoft/diskann-disk-maintainers` for review.

The team should then inspect the uploaded artifacts (retained 30 days) to
determine whether thresholds need tuning or there is a runner environment
issue.

## Comparison with A/B Benchmarks

The A/A test should be distinguished from the **A/B benchmark** workflow
([`disk-benchmarks.yml`](../workflows/disk-benchmarks.yml)), which compares
a **PR branch vs. a baseline** (usually `main`). The A/B workflow builds two
separate binaries and is designed to catch performance regressions in code
changes. In contrast, the A/A test builds only once and runs twice, so any
differences are purely due to environment variability.
30 changes: 29 additions & 1 deletion .github/workflows/disk-benchmarks-aa.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
# Runs main vs main at 9 AM UTC every day to detect environment noise.
# If any threshold is breached, a GitHub issue is created to notify @microsoft/diskann-admin.
# Can also be triggered manually for debugging.
#
# For more details see .github/docs/disk-benchmarks-aa.md

name: Disk Benchmarks (A/A)

Expand Down Expand Up @@ -102,14 +104,39 @@ jobs:
diskann_rust/target/tmp/${{ matrix.dataset }}_baseline.json
retention-days: 30

# Notify diskann-admin on A/A failure
# Notify diskann-disk-maintainers on A/A failure — but only when the failure
# rate exceeds 5% over the last 20 runs. Our reliability promise is 95%, so
# 1 failure in 20 runs is expected and should not trigger a notification.
notify-on-failure:
name: Notify on A/A Failure
needs: [aa-benchmark]
runs-on: ubuntu-latest
if: failure()
steps:
- name: Check recent failure rate
id: check-rate
uses: actions/github-script@v7
with:
script: |
const { data: runs } = await github.rest.actions.listWorkflowRuns({
owner: context.repo.owner,
repo: context.repo.repo,
workflow_id: 'disk-benchmarks-aa.yml',
per_page: 20,
status: 'completed',
});
const total = runs.workflow_runs.length;
const failures = runs.workflow_runs.filter(r => r.conclusion === 'failure').length;
const failureRate = total > 0 ? failures / total : 0;
console.log(`Recent A/A runs: ${total}, failures: ${failures}, rate: ${(failureRate * 100).toFixed(1)}%`);
// Only notify if failure rate exceeds our 5% reliability budget
core.setOutput('should_notify', failureRate > 0.05 ? 'true' : 'false');
core.setOutput('failure_rate', `${(failureRate * 100).toFixed(1)}%`);
core.setOutput('failures', `${failures}`);
core.setOutput('total', `${total}`);

- name: Create GitHub issue for A/A failure
if: steps.check-rate.outputs.should_notify == 'true'
uses: actions/github-script@v7
with:
script: |
Expand All @@ -126,6 +153,7 @@ jobs:
`This indicates environment noise exceeded the configured thresholds.`,
``,
`**Run:** ${runUrl}`,
`**Recent failure rate:** ${{ steps.check-rate.outputs.failures }}/${{ steps.check-rate.outputs.total }} (${{ steps.check-rate.outputs.failure_rate }}) — exceeds 5% reliability budget`,
``,
`Please review the benchmark artifacts and determine if thresholds need tuning`,
`or if there is a runner environment issue.`,
Expand Down
Loading