Skip to content

perf(index): bulk conjunction path for FTS AND and phrase queries#7624

Open
BubbleCal wants to merge 1 commit into
yang/lan2-88-fts-maxscore-defaultfrom
yang/lan2-88-fts-bulk-conjunction
Open

perf(index): bulk conjunction path for FTS AND and phrase queries#7624
BubbleCal wants to merge 1 commit into
yang/lan2-88-fts-maxscore-defaultfrom
yang/lan2-88-fts-bulk-conjunction

Conversation

@BubbleCal

@BubbleCal BubbleCal commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Stacked on #7604 (base of this PR); part of the Lucene-parity series after #7600-#7605.

What

Top-k AND and phrase queries previously leapfrogged doc-at-a-time through boxed PostingIterator::next calls — 61% of the AND profile went to per-doc advance machinery (~25-40ns per advance). Phrase checks additionally decoded a whole 256-doc position block per candidate (39% of the phrase profile) and allocated cursor vectors per candidate.

This adds a bulk conjunction path (and_bulk_search, default on, LANCE_FTS_BULK_AND=0 opts back into the classic loop):

  • Same block-max window pruning as the classic loop, but candidates come from a k-pointer merge over the decompressed block slices — the window ends at the nearest next-block boundary across clauses, so each clause contributes exactly one block per window and the merge runs on plain u32 slices.
  • Two-pass batched scoring: the merge only records (doc, per-clause offsets); doc lengths are gathered back-to-back (cache misses overlap) and the prune/verify/score/insert pass replays the classic loop's exact semantics.
  • seek_packed_doc_positions: PackedDelta position groups are self-describing ([num_bits u8][16*num_bits bytes]), so group offsets are recovered by hopping headers — a phrase candidate decodes only the 1-2 groups overlapping its own doc instead of the whole block, with a lazily-built group index and a decoded-tail cache. No format change.
  • check_exact_positions_bulk: allocation-free slop=0 alignment check over the decoded scratch slices.
  • The doc-length gather goes through DocSet::scoring_num_tokens, so V3 partitions score with the quantized lengths feat(fts)!: add configurable posting block size #7466 defines. (The impact-score-cache dead-code cleanup originally carried here moved to perf(fts): bulk MAXSCORE search path for top-k disjunctions #7603, where the code actually becomes dead.)

Results (mmlb-200m warm, 8 concurrent, vs Lucene 10.4)

query before after
AND k10 @200M 0.114s / 69 qps 0.060s / 132 qps
AND k100 @200M 0.240s / 33 qps 0.118s / 67 qps
phrase 3w k10 @50m 0.335s / 23.8 qps 0.221s / 36.1 qps
phrase 2w k10 @50m 0.098s / 81.6 qps 0.048s / 165 qps

Verification

  • Results are identical to the classic loop: bulk-vs-classic A/B on 200 query×k pairs, score_diff=0 (AND @200M and phrase @50m).
  • New rstest parity suite runs both paths on multi-block corpora (2/3/4 clauses × AND/phrase × k) and asserts identical candidates; encoding roundtrip test covers per-doc seek vs whole-block decode for every doc across tail/group-straddling shapes.
  • cargo test -p lance-index, clippy -D warnings, fmt --check clean.

🤖 Generated with Claude Code

AND and phrase queries previously leapfrogged doc-at-a-time through
boxed PostingIterator::next calls (~61% of the AND profile) and phrase
checks decoded a whole 256-doc position block per candidate (~39% of
the phrase profile).

- and_bulk_search: block-max window pruning plus a k-pointer merge over
  decompressed block slices; per-candidate advance cost drops to a few
  loads. Results are identical to the classic loop (LANCE_FTS_BULK_AND=0
  opts out). Phrase queries ride the same path.
- seek_packed_doc_positions: PackedDelta full groups are self-describing
  ([num_bits u8][16*num_bits bytes]), so group offsets are recovered by
  hopping headers; decode only the 1-2 groups overlapping the candidate
  doc's delta range, with a lazily-built group index, memoized unpacked
  group, and a decoded-tail cache per block.
- check_exact_positions_bulk: allocation-free slop=0 alignment check on
  the decoded scratch slices for parked lead clauses.

Warm mmlb benchmarks, 8 concurrent queries: AND\@200M k10 0.114->0.060s,
k100 0.240->0.118s; phrase\@50m 3-word k10 0.335->0.210s, 2-word k10
0.098->0.042s. All steps verified score-identical to the classic path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant