Skip to content

Add stats-v1 delta admission and branch-aware Git indexing#1050

Open
ebrevdo wants to merge 10 commits intosourcegraph:mainfrom
ebrevdo:main
Open

Add stats-v1 delta admission and branch-aware Git indexing#1050
ebrevdo wants to merge 10 commits intosourcegraph:mainfrom
ebrevdo:main

Conversation

@ebrevdo
Copy link
Copy Markdown

@ebrevdo ebrevdo commented Apr 25, 2026

Experimental stats-v1 delta admission and advisory repository stats

This adds an opt-in stats-v1 delta admission mode that decides whether to accept a delta build from measured write mass and accumulated read debt, rather than relying only on the previous fixed shard-count threshold. The empty mode preserves historical behavior. When stats-v1 is enabled, Zoekt computes the size of the candidate delta, the live indexed corpus, physical bytes already present in stacked shards, tombstone pressure, and shard fanout before either accepting the delta or falling back to a normal rebuild.

The new stats are advisory metadata, not correctness metadata. Full builds seed DeltaStats, accepted deltas update those stats forward, and older indexes without stats are backfilled from current Git trees plus existing shard statistics where possible. Correctness gates such as option-hash mismatch, unsupported ignore-file changes, unsupported submodules, and unsupported compound shard cases remain separate from the cost gates, so a bad or missing stats path falls back instead of weakening search correctness.

The admission path also writes structured JSONL decision logs for calibration. Each row includes the mode, accept/reject result, reason, candidate bytes/docs/paths, live and physical counters, tombstone ratio inputs, shard fanout, thresholds, and branch mapping context when branch sets change. This gives operators a stable data source for threshold tuning without scraping stderr.

  • api.go adds RepositoryDeltaStats to zoekt.Repository with live indexed bytes, live document/path counts, physical bytes/docs, tombstone path count, delta layer count, and last full-index time.
  • grpc/protos/zoekt/webserver/v1/webserver.proto and api_proto.go add proto support for the transferable delta-stats counters, while api_test.go covers JSON compatibility and MergeMutable behavior.
  • gitindex/index.go adds DeltaAdmissionModeStatsV1, DeltaAdmissionThresholds, default threshold values, mode validation, candidate/live/physical stats helpers, old-index backfill, and the prepareDeltaStatsV1 admission gate.
  • gitindex/index.go evaluates the stats-v1 gates in write-mass-first order: candidate indexed bytes ratio, next physical/live bytes ratio, tombstone path ratio, and shard fanout ratio.
  • index/builder.go preserves accepted delta stats by copying RepositoryDescription.DeltaStats into old shard .meta files during delta metadata rewrites.
  • cmd/zoekt-git-index/main.go wires -delta_admission_mode and -delta_admission_log_json; cmd/zoekt-sourcegraph-indexserver/main.go validates DELTA_ADMISSION_MODE, and cmd/zoekt-sourcegraph-indexserver/index.go passes the mode through to zoekt-git-index.
  • gitindex/index_test.go, index/builder_test.go, cmd/zoekt-git-index/main_test.go, and cmd/zoekt-sourcegraph-indexserver/*_test.go cover stats seeding, old-index backfill, debt-based fallback, empty-mode compatibility, builder metadata propagation, CLI wiring, and indexserver env parsing.

Branch-set, HEAD, and linked-worktree delta indexing

This makes branch identity explicit for Git indexes that use HEAD or linked worktrees. ResolveHEADToBranch resolves an attached HEAD to the short branch name stored in repository metadata, while detached HEAD still stays as HEAD. That avoids treating two attached worktrees as an anonymous HEAD branch and makes subsequent delta decisions compare the actual branch set.

Branch-set changes remain opt-in. Without AllowDeltaBranchSetChange, a branch-set mismatch keeps the old behavior and falls back to a normal rebuild. With the flag enabled, the delta path can rewrite old shard metadata and tombstone old live paths before adding the final live content for the new branch set. The implementation also includes a smarter branch-set path for same-shape branch switches and renames, where only affected paths are reindexed instead of rewriting an unchanged large corpus.

The CLI grows a -worktrees mode for indexing multiple linked worktrees as one repository. The mode infers each worktree's attached branch, requires all arguments to share one Git common directory, rejects explicit -branches, and rejects detached worktrees. It then indexes the inferred branch set through the same resolved-branch metadata path, so branch-specific queries see each worktree under its real branch name.

  • gitindex/index.go adds ResolveHEADToBranch, AllowDeltaBranchSetChange, expandBranchesForOptions, resolveHEADBranchName, and order-preserving branch de-duplication for resolved or branch-set-aware modes.
  • gitindex/index.go adds prepareBranchSetDeltaBuild, prepareSmartBranchSetDeltaBuild, branch-set validation, indexed-path tombstone collection, and early compound-shard rejection for stats-v1 delta builds.
  • index/builder.go adds AllowDeltaBranchSetChange to builder options so gitindex can explicitly authorize branch metadata rewrites while default builder behavior remains strict.
  • cmd/zoekt-git-index/main.go adds -resolve_head_to_branch, -allow_delta_branch_set_change, and -worktrees, plus worktree common-dir and attached-branch discovery helpers.
  • cmd/zoekt-sourcegraph-indexserver/index.go carries the resolved-HEAD and branch-set-change flags into the child zoekt-git-index invocation when those options are enabled.
  • gitindex/delta_multibranch_*_test.go, gitindex/delta_smart_branchset_stats_test.go, and the expanded gitindex/index_test.go cover baseline branch-set compatibility, branch renames, branch adds/removes, combined branch-set changes, HEAD/worktree transitions, duplicate or ambiguous branch mappings, missing commits, compound-shard safety, stats after branch changes, and query-surface equivalence against clean rebuilds.
  • cmd/zoekt-git-index/main_test.go verifies that -worktrees produces one repository spec, infers attached branch names, rejects explicit branches, indexes both worktree branches, and keeps branch-filtered searches isolated.

Calibration tooling and regression coverage

The PR adds repeatable calibration scripts for understanding where stats-v1 should accept deltas and where it should force compaction through a full rebuild. These scripts run one-step and sequential history replays, capture JSONL admission decisions, collect timing and resource data, and compare stacked-delta query behavior against clean full indexes. They are intentionally separate from production code so threshold experiments can be repeated on disposable worktrees without changing runtime behavior.

The test surface is broadened around the new behavior rather than just asserting happy paths. There are compatibility tests for the default empty mode, explicit fallback tests for high write mass or unsafe branch mappings, tests that old indexes can be backfilled and then write stats forward, and tests that accepted branch-set deltas produce the same visible branch/query surface as a clean rebuild. Sourcegraph indexserver tests cover passing the experimental flags through the process boundary.

The generated protobuf file is updated with the new repository stats field, and the committed tests document the intended safety envelope for future work. In particular, the coverage distinguishes cheap branch switches and renames from branch additions or reorders that still need the conservative fallback path, and it records which cases are structural safety failures rather than tunable cost decisions.

  • scripts/delta-admission-one-step.sh runs a focused full-then-delta experiment for one commit transition and emits admission logs for inspection.
  • scripts/delta-admission-sequential.sh replays a commit range through one persistent index stack, recording each stats-v1 accept/fallback decision and timing/resource data.
  • scripts/query-stats-stacked-vs-clean.sh builds forced stacked and clean indexes at each step, runs a selected query through both, and records side-by-side query statistics.
  • scripts/analyze-query-stats-stacked-vs-clean.sh post-processes query-stat output so stacked-vs-clean ratios can inform shard fanout and physical/live threshold choices.
  • grpc/protos/zoekt/webserver/v1/webserver.pb.go is regenerated for the new Repository.delta_stats proto field.
  • The new and expanded tests span API/proto round trips, gitindex admission decisions, branch-set delta behavior, worktree indexing, builder metadata rewrites, sourcegraph-indexserver env parsing, and process-argument wiring.

Usecase

Here's how I use this:

# zoekt-git-index-worktrees.sh

#!/usr/bin/env bash
set -euo pipefail

index_dir="${ZOEKT_INDEX_DIR:-$HOME/.zoekt}"
mkdir -p "$index_dir"

worktrees=()
for d in "$HOME/code/<monorepo>" "$HOME"/code/*/<monorepo>; do
  [[ -d "$d" ]] || continue
  if git -C "$d" symbolic-ref -q --short HEAD >/dev/null; then
    worktrees+=("$d")
  fi
done

if [[ ${#worktrees[@]} -eq 0 ]]; then
  echo "no attached worktrees found under ~/code" >&2
  exit 1
fi

exec /usr/local/bin/zoekt-git-index \
  -index "$index_dir" \
  -submodules=false \
  -delta \
  -delta_admission_mode stats-v1 \
  -allow_delta_branch_set_change \
  -worktrees \
  "$@" \
  "${worktrees[@]}"

and my local zoekt binary:

#!/usr/bin/env bash
set -euo pipefail

if ! git rev-parse --show-toplevel >/dev/null 2>&1; then
  exec /usr/local/bin/zoekt "$@"
fi

branch="$(git symbolic-ref -q --short HEAD || true)"
if [[ -z "$branch" ]]; then
  echo "zoekt wrapper: current git checkout has detached HEAD; cannot infer branch filter" >&2
  exit 1
fi

remote="$(git config --get remote.origin.url || true)"
if [[ -z "$remote" ]]; then
  echo "zoekt wrapper: remote.origin.url is not set; cannot infer repo filter" >&2
  exit 1
fi

repo="$remote"
case "$repo" in
  git@*:*)
    host="${repo#git@}"
    host="${host%%:*}"
    path="${repo#*:}"
    repo="$host/${path%.git}"
    ;;
  ssh://git@*/*)
    repo="${repo#ssh://git@}"
    repo="${repo%.git}"
    ;;
  https://*/*)
    repo="${repo#https://}"
    repo="${repo%.git}"
    ;;
  http://*/*)
    repo="${repo#http://}"
    repo="${repo%.git}"
    ;;
  file://*)
    repo="${repo#file://}"
    repo="${repo%.git}"
    ;;
  *)
    repo="${repo%.git}"
    ;;
esac

query="r:$repo branch:$branch"
if [[ $# -gt 0 ]]; then
  query="$query $*"
fi

exec /usr/local/bin/zoekt -index_dir "${ZOEKT_INDEX_DIR:-$HOME/.zoekt}" "$query"

ebrevdo and others added 10 commits April 17, 2026 08:38
Add opt-in stats-v1 delta admission using persistent repository delta stats, JSONL decision logging, and calibration scripts. Keep empty admission mode on the historical behavior while letting stats-v1 use write-mass and read-pressure gates.

Co-authored-by: Codex <noreply@openai.com>
Add desired-future tests for branch-set delta updates across exact multi-branch updates, renames, additions, removals, combined changes, HEAD/worktree resolution, ambiguity handling, stats, and query surfaces. Most branch-set change tests are expected to fail until explicit branch mapping is implemented.

Co-authored-by: Codex <noreply@openai.com>
Support delta indexing across branch-set changes by explicitly permitting branch metadata rewrites only when gitindex prepares a full-live delta layer. The conservative path tombstones all old live paths and re-adds the final live branch set, giving correct branch-filtered lookups before optimizing branch mappings.

Co-authored-by: Codex <noreply@openai.com>
Add branch-count and branch-mapping fields to stats-v1 admission JSONL records, and update HEAD resolution tests for the branch-set delta behavior and existing branch:HEAD alias semantics.

Co-authored-by: Codex <noreply@openai.com>
Add an explicit gitindex option for branch-set-changing delta updates and restore the default branch-set mismatch fallback. Desired-future branch-set tests opt in, while ResolveHEADToBranch continues to govern HEAD naming behavior.

Co-authored-by: Codex <noreply@openai.com>
Document and test that AllowDeltaBranchSetChange is the explicit opt-in provenance for unmatched branch-slot updates. With the flag set, unmatched branch sets use the conservative full-live delta path; structurally unsafe cases remain guarded separately.

Co-authored-by: Codex <noreply@openai.com>
Preserve RepositoryDeltaStats across webserver protobuf conversion and keep the worktree test helper from overriding origin-derived repository identity.

Co-authored-by: Codex <noreply@openai.com>
Use positional branch mapping for same-length branch-set changes so near-identical HEAD branch switches and branch renames only rewrite affected paths. Expose the required experimental git-index flags through the CLI and indexserver command builder.

Co-authored-by: Codex <noreply@openai.com>
Allow zoekt-git-index to accept multiple linked worktree paths and index their attached HEAD branches together as one multi-branch repository.

Co-authored-by: Codex <noreply@openai.com>
@keegancsmith
Copy link
Copy Markdown
Member

@ebrevdo thanks for the contribution. I will take a deeper review soon. Out of interest, is this contribution motivated by you using zoekt on your desktop, or are you running this in a server environment? Based on all the scripts looks like you use the local zoekt binary a bunch, but maybe that was just for testing?

Your PR is marked that I can push to the branch, but for some reason github won't let me. This patch should resolve your CI issues:

Details
diff --git a/grpc/protos/zoekt/webserver/v1/webserver.pb.go b/grpc/protos/zoekt/webserver/v1/webserver.pb.go
index bc2091f..061ffea 100644
--- a/grpc/protos/zoekt/webserver/v1/webserver.pb.go
+++ b/grpc/protos/zoekt/webserver/v1/webserver.pb.go
@@ -1,7 +1,7 @@
 // Code generated by protoc-gen-go. DO NOT EDIT.
 // versions:
 // 	protoc-gen-go v1.29.1
-// 	protoc        v6.33.0
+// 	protoc        (unknown)
 // source: zoekt/webserver/v1/webserver.proto
 
 package v1
diff --git a/scripts/analyze-query-stats-stacked-vs-clean.sh b/scripts/analyze-query-stats-stacked-vs-clean.sh
index 5e98f1c..b2a842e 100755
--- a/scripts/analyze-query-stats-stacked-vs-clean.sh
+++ b/scripts/analyze-query-stats-stacked-vs-clean.sh
@@ -11,9 +11,13 @@ delta vs. full rebuild thresholds.
 EOF
 }
 
-if [[ $# -ne 1 || "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
   usage
-  exit $([[ $# -eq 1 ]] && 0 || 2)
+  exit 0
+fi
+if [[ $# -ne 1 ]]; then
+  usage
+  exit 2
 fi
 
 input="$1"
diff --git a/scripts/delta-admission-one-step.sh b/scripts/delta-admission-one-step.sh
index cb327fc..5d5748f 100755
--- a/scripts/delta-admission-one-step.sh
+++ b/scripts/delta-admission-one-step.sh
@@ -95,6 +95,7 @@ timed_run() {
   local stderr_file="$3"
   shift 3
 
+  # shellcheck disable=SC2016 # variables are intentionally expanded by the inner bash
   /usr/bin/time -l bash -c '
     stdout_file="$1"
     stderr_file="$2"
diff --git a/scripts/delta-admission-sequential.sh b/scripts/delta-admission-sequential.sh
index 331b79b..1774cc0 100755
--- a/scripts/delta-admission-sequential.sh
+++ b/scripts/delta-admission-sequential.sh
@@ -92,6 +92,7 @@ timed_run() {
   local stderr_file="$3"
   shift 3
 
+  # shellcheck disable=SC2016 # variables are intentionally expanded by the inner bash
   /usr/bin/time -l bash -c '
     stdout_file="$1"
     stderr_file="$2"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants