Add stats-v1 delta admission and branch-aware Git indexing#1050
Open
ebrevdo wants to merge 10 commits intosourcegraph:mainfrom
Open
Add stats-v1 delta admission and branch-aware Git indexing#1050ebrevdo wants to merge 10 commits intosourcegraph:mainfrom
ebrevdo wants to merge 10 commits intosourcegraph:mainfrom
Conversation
Add opt-in stats-v1 delta admission using persistent repository delta stats, JSONL decision logging, and calibration scripts. Keep empty admission mode on the historical behavior while letting stats-v1 use write-mass and read-pressure gates. Co-authored-by: Codex <noreply@openai.com>
Add desired-future tests for branch-set delta updates across exact multi-branch updates, renames, additions, removals, combined changes, HEAD/worktree resolution, ambiguity handling, stats, and query surfaces. Most branch-set change tests are expected to fail until explicit branch mapping is implemented. Co-authored-by: Codex <noreply@openai.com>
Support delta indexing across branch-set changes by explicitly permitting branch metadata rewrites only when gitindex prepares a full-live delta layer. The conservative path tombstones all old live paths and re-adds the final live branch set, giving correct branch-filtered lookups before optimizing branch mappings. Co-authored-by: Codex <noreply@openai.com>
Add branch-count and branch-mapping fields to stats-v1 admission JSONL records, and update HEAD resolution tests for the branch-set delta behavior and existing branch:HEAD alias semantics. Co-authored-by: Codex <noreply@openai.com>
Add an explicit gitindex option for branch-set-changing delta updates and restore the default branch-set mismatch fallback. Desired-future branch-set tests opt in, while ResolveHEADToBranch continues to govern HEAD naming behavior. Co-authored-by: Codex <noreply@openai.com>
Document and test that AllowDeltaBranchSetChange is the explicit opt-in provenance for unmatched branch-slot updates. With the flag set, unmatched branch sets use the conservative full-live delta path; structurally unsafe cases remain guarded separately. Co-authored-by: Codex <noreply@openai.com>
Preserve RepositoryDeltaStats across webserver protobuf conversion and keep the worktree test helper from overriding origin-derived repository identity. Co-authored-by: Codex <noreply@openai.com>
Use positional branch mapping for same-length branch-set changes so near-identical HEAD branch switches and branch renames only rewrite affected paths. Expose the required experimental git-index flags through the CLI and indexserver command builder. Co-authored-by: Codex <noreply@openai.com>
Allow zoekt-git-index to accept multiple linked worktree paths and index their attached HEAD branches together as one multi-branch repository. Co-authored-by: Codex <noreply@openai.com>
Member
|
@ebrevdo thanks for the contribution. I will take a deeper review soon. Out of interest, is this contribution motivated by you using zoekt on your desktop, or are you running this in a server environment? Based on all the scripts looks like you use the local zoekt binary a bunch, but maybe that was just for testing? Your PR is marked that I can push to the branch, but for some reason github won't let me. This patch should resolve your CI issues: Detailsdiff --git a/grpc/protos/zoekt/webserver/v1/webserver.pb.go b/grpc/protos/zoekt/webserver/v1/webserver.pb.go
index bc2091f..061ffea 100644
--- a/grpc/protos/zoekt/webserver/v1/webserver.pb.go
+++ b/grpc/protos/zoekt/webserver/v1/webserver.pb.go
@@ -1,7 +1,7 @@
// Code generated by protoc-gen-go. DO NOT EDIT.
// versions:
// protoc-gen-go v1.29.1
-// protoc v6.33.0
+// protoc (unknown)
// source: zoekt/webserver/v1/webserver.proto
package v1
diff --git a/scripts/analyze-query-stats-stacked-vs-clean.sh b/scripts/analyze-query-stats-stacked-vs-clean.sh
index 5e98f1c..b2a842e 100755
--- a/scripts/analyze-query-stats-stacked-vs-clean.sh
+++ b/scripts/analyze-query-stats-stacked-vs-clean.sh
@@ -11,9 +11,13 @@ delta vs. full rebuild thresholds.
EOF
}
-if [[ $# -ne 1 || "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
usage
- exit $([[ $# -eq 1 ]] && 0 || 2)
+ exit 0
+fi
+if [[ $# -ne 1 ]]; then
+ usage
+ exit 2
fi
input="$1"
diff --git a/scripts/delta-admission-one-step.sh b/scripts/delta-admission-one-step.sh
index cb327fc..5d5748f 100755
--- a/scripts/delta-admission-one-step.sh
+++ b/scripts/delta-admission-one-step.sh
@@ -95,6 +95,7 @@ timed_run() {
local stderr_file="$3"
shift 3
+ # shellcheck disable=SC2016 # variables are intentionally expanded by the inner bash
/usr/bin/time -l bash -c '
stdout_file="$1"
stderr_file="$2"
diff --git a/scripts/delta-admission-sequential.sh b/scripts/delta-admission-sequential.sh
index 331b79b..1774cc0 100755
--- a/scripts/delta-admission-sequential.sh
+++ b/scripts/delta-admission-sequential.sh
@@ -92,6 +92,7 @@ timed_run() {
local stderr_file="$3"
shift 3
+ # shellcheck disable=SC2016 # variables are intentionally expanded by the inner bash
/usr/bin/time -l bash -c '
stdout_file="$1"
stderr_file="$2" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experimental stats-v1 delta admission and advisory repository stats
This adds an opt-in
stats-v1delta admission mode that decides whether to accept a delta build from measured write mass and accumulated read debt, rather than relying only on the previous fixed shard-count threshold. The empty mode preserves historical behavior. Whenstats-v1is enabled, Zoekt computes the size of the candidate delta, the live indexed corpus, physical bytes already present in stacked shards, tombstone pressure, and shard fanout before either accepting the delta or falling back to a normal rebuild.The new stats are advisory metadata, not correctness metadata. Full builds seed
DeltaStats, accepted deltas update those stats forward, and older indexes without stats are backfilled from current Git trees plus existing shard statistics where possible. Correctness gates such as option-hash mismatch, unsupported ignore-file changes, unsupported submodules, and unsupported compound shard cases remain separate from the cost gates, so a bad or missing stats path falls back instead of weakening search correctness.The admission path also writes structured JSONL decision logs for calibration. Each row includes the mode, accept/reject result, reason, candidate bytes/docs/paths, live and physical counters, tombstone ratio inputs, shard fanout, thresholds, and branch mapping context when branch sets change. This gives operators a stable data source for threshold tuning without scraping stderr.
api.goaddsRepositoryDeltaStatstozoekt.Repositorywith live indexed bytes, live document/path counts, physical bytes/docs, tombstone path count, delta layer count, and last full-index time.grpc/protos/zoekt/webserver/v1/webserver.protoandapi_proto.goadd proto support for the transferable delta-stats counters, whileapi_test.gocovers JSON compatibility andMergeMutablebehavior.gitindex/index.goaddsDeltaAdmissionModeStatsV1,DeltaAdmissionThresholds, default threshold values, mode validation, candidate/live/physical stats helpers, old-index backfill, and theprepareDeltaStatsV1admission gate.gitindex/index.goevaluates thestats-v1gates in write-mass-first order: candidate indexed bytes ratio, next physical/live bytes ratio, tombstone path ratio, and shard fanout ratio.index/builder.gopreserves accepted delta stats by copyingRepositoryDescription.DeltaStatsinto old shard.metafiles during delta metadata rewrites.cmd/zoekt-git-index/main.gowires-delta_admission_modeand-delta_admission_log_json;cmd/zoekt-sourcegraph-indexserver/main.govalidatesDELTA_ADMISSION_MODE, andcmd/zoekt-sourcegraph-indexserver/index.gopasses the mode through tozoekt-git-index.gitindex/index_test.go,index/builder_test.go,cmd/zoekt-git-index/main_test.go, andcmd/zoekt-sourcegraph-indexserver/*_test.gocover stats seeding, old-index backfill, debt-based fallback, empty-mode compatibility, builder metadata propagation, CLI wiring, and indexserver env parsing.Branch-set, HEAD, and linked-worktree delta indexing
This makes branch identity explicit for Git indexes that use
HEADor linked worktrees.ResolveHEADToBranchresolves an attachedHEADto the short branch name stored in repository metadata, while detachedHEADstill stays asHEAD. That avoids treating two attached worktrees as an anonymousHEADbranch and makes subsequent delta decisions compare the actual branch set.Branch-set changes remain opt-in. Without
AllowDeltaBranchSetChange, a branch-set mismatch keeps the old behavior and falls back to a normal rebuild. With the flag enabled, the delta path can rewrite old shard metadata and tombstone old live paths before adding the final live content for the new branch set. The implementation also includes a smarter branch-set path for same-shape branch switches and renames, where only affected paths are reindexed instead of rewriting an unchanged large corpus.The CLI grows a
-worktreesmode for indexing multiple linked worktrees as one repository. The mode infers each worktree's attached branch, requires all arguments to share one Git common directory, rejects explicit-branches, and rejects detached worktrees. It then indexes the inferred branch set through the same resolved-branch metadata path, so branch-specific queries see each worktree under its real branch name.gitindex/index.goaddsResolveHEADToBranch,AllowDeltaBranchSetChange,expandBranchesForOptions,resolveHEADBranchName, and order-preserving branch de-duplication for resolved or branch-set-aware modes.gitindex/index.goaddsprepareBranchSetDeltaBuild,prepareSmartBranchSetDeltaBuild, branch-set validation, indexed-path tombstone collection, and early compound-shard rejection forstats-v1delta builds.index/builder.goaddsAllowDeltaBranchSetChangeto builder options so gitindex can explicitly authorize branch metadata rewrites while default builder behavior remains strict.cmd/zoekt-git-index/main.goadds-resolve_head_to_branch,-allow_delta_branch_set_change, and-worktrees, plus worktree common-dir and attached-branch discovery helpers.cmd/zoekt-sourcegraph-indexserver/index.gocarries the resolved-HEAD and branch-set-change flags into the childzoekt-git-indexinvocation when those options are enabled.gitindex/delta_multibranch_*_test.go,gitindex/delta_smart_branchset_stats_test.go, and the expandedgitindex/index_test.gocover baseline branch-set compatibility, branch renames, branch adds/removes, combined branch-set changes, HEAD/worktree transitions, duplicate or ambiguous branch mappings, missing commits, compound-shard safety, stats after branch changes, and query-surface equivalence against clean rebuilds.cmd/zoekt-git-index/main_test.goverifies that-worktreesproduces one repository spec, infers attached branch names, rejects explicit branches, indexes both worktree branches, and keeps branch-filtered searches isolated.Calibration tooling and regression coverage
The PR adds repeatable calibration scripts for understanding where
stats-v1should accept deltas and where it should force compaction through a full rebuild. These scripts run one-step and sequential history replays, capture JSONL admission decisions, collect timing and resource data, and compare stacked-delta query behavior against clean full indexes. They are intentionally separate from production code so threshold experiments can be repeated on disposable worktrees without changing runtime behavior.The test surface is broadened around the new behavior rather than just asserting happy paths. There are compatibility tests for the default empty mode, explicit fallback tests for high write mass or unsafe branch mappings, tests that old indexes can be backfilled and then write stats forward, and tests that accepted branch-set deltas produce the same visible branch/query surface as a clean rebuild. Sourcegraph indexserver tests cover passing the experimental flags through the process boundary.
The generated protobuf file is updated with the new repository stats field, and the committed tests document the intended safety envelope for future work. In particular, the coverage distinguishes cheap branch switches and renames from branch additions or reorders that still need the conservative fallback path, and it records which cases are structural safety failures rather than tunable cost decisions.
scripts/delta-admission-one-step.shruns a focused full-then-delta experiment for one commit transition and emits admission logs for inspection.scripts/delta-admission-sequential.shreplays a commit range through one persistent index stack, recording eachstats-v1accept/fallback decision and timing/resource data.scripts/query-stats-stacked-vs-clean.shbuilds forced stacked and clean indexes at each step, runs a selected query through both, and records side-by-side query statistics.scripts/analyze-query-stats-stacked-vs-clean.shpost-processes query-stat output so stacked-vs-clean ratios can inform shard fanout and physical/live threshold choices.grpc/protos/zoekt/webserver/v1/webserver.pb.gois regenerated for the newRepository.delta_statsproto field.Usecase
Here's how I use this:
and my local zoekt binary: