perf(index): bulk conjunction path for FTS AND and phrase queries by BubbleCal · Pull Request #7624 · lance-format/lance

BubbleCal · 2026-07-04T12:31:36Z

Stacked on #7604 (base of this PR); part of the Lucene-parity series after #7600-#7605.

What

Top-k AND and phrase queries previously leapfrogged doc-at-a-time through boxed PostingIterator::next calls — 61% of the AND profile went to per-doc advance machinery (~25-40ns per advance). Phrase checks additionally decoded a whole 256-doc position block per candidate (39% of the phrase profile) and allocated cursor vectors per candidate.

This adds a bulk conjunction path (and_bulk_search, default on, LANCE_FTS_BULK_AND=0 opts back into the classic loop):

Same block-max window pruning as the classic loop, but candidates come from a k-pointer merge over the decompressed block slices — the window ends at the nearest next-block boundary across clauses, so each clause contributes exactly one block per window and the merge runs on plain u32 slices.
Two-pass batched scoring: the merge only records (doc, per-clause offsets); doc lengths are gathered back-to-back (cache misses overlap) and the prune/verify/score/insert pass replays the classic loop's exact semantics.
seek_packed_doc_positions: PackedDelta position groups are self-describing ([num_bits u8][16*num_bits bytes]), so group offsets are recovered by hopping headers — a phrase candidate decodes only the 1-2 groups overlapping its own doc instead of the whole block, with a lazily-built group index and a decoded-tail cache. No format change.
check_exact_positions_bulk: allocation-free slop=0 alignment check over the decoded scratch slices.
The doc-length gather goes through DocSet::scoring_num_tokens, so V3 partitions score with the quantized lengths feat(fts)!: add configurable posting block size #7466 defines. (The impact-score-cache dead-code cleanup originally carried here moved to perf(fts): bulk MAXSCORE search path for top-k disjunctions #7603, where the code actually becomes dead.)

Results (mmlb-200m warm, 8 concurrent, vs Lucene 10.4)

query	before	after
AND k10 @200M	0.114s / 69 qps	0.060s / 132 qps
AND k100 @200M	0.240s / 33 qps	0.118s / 67 qps
phrase 3w k10 @50m	0.335s / 23.8 qps	0.221s / 36.1 qps
phrase 2w k10 @50m	0.098s / 81.6 qps	0.048s / 165 qps

Verification

Results are identical to the classic loop: bulk-vs-classic A/B on 200 query×k pairs, score_diff=0 (AND @200M and phrase @50m).
New rstest parity suite runs both paths on multi-block corpora (2/3/4 clauses × AND/phrase × k) and asserts identical candidates; encoding roundtrip test covers per-doc seek vs whole-block decode for every doc across tail/group-straddling shapes.
cargo test -p lance-index, clippy -D warnings, fmt --check clean.

🤖 Generated with Claude Code

@200M

AND and phrase queries previously leapfrogged doc-at-a-time through boxed PostingIterator::next calls (~61% of the AND profile) and phrase checks decoded a whole 256-doc position block per candidate (~39% of the phrase profile). - and_bulk_search: block-max window pruning plus a k-pointer merge over decompressed block slices; per-candidate advance cost drops to a few loads. Results are identical to the classic loop (LANCE_FTS_BULK_AND=0 opts out). Phrase queries ride the same path. - seek_packed_doc_positions: PackedDelta full groups are self-describing ([num_bits u8][16*num_bits bytes]), so group offsets are recovered by hopping headers; decode only the 1-2 groups overlapping the candidate doc's delta range, with a lazily-built group index, memoized unpacked group, and a decoded-tail cache per block. - check_exact_positions_bulk: allocation-free slop=0 alignment check on the decoded scratch slices for parked lead clauses. Warm mmlb benchmarks, 8 concurrent queries: AND\@200M k10 0.114->0.060s, k100 0.240->0.118s; phrase\@50m 3-word k10 0.335->0.210s, 2-word k10 0.098->0.042s. All steps verified score-identical to the classic path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jul 4, 2026

BubbleCal mentioned this pull request Jul 4, 2026

perf(index): vectorized merge kernels and frequency-bound pruning for FTS conjunctions #7625

Open

BubbleCal force-pushed the yang/lan2-88-fts-maxscore-default branch from 4732787 to fec0a88 Compare July 4, 2026 18:17

BubbleCal force-pushed the yang/lan2-88-fts-bulk-conjunction branch from fc1af4b to 22e3522 Compare July 4, 2026 18:17

This was referenced Jul 4, 2026

feat(index)!: quantized doc-length scoring and slimmer posting blocks for FTS v3 #7626

Closed

feat(fts)!: add configurable posting block size #7466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(index): bulk conjunction path for FTS AND and phrase queries#7624

perf(index): bulk conjunction path for FTS AND and phrase queries#7624
BubbleCal wants to merge 1 commit into
yang/lan2-88-fts-maxscore-defaultfrom
yang/lan2-88-fts-bulk-conjunction

BubbleCal commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BubbleCal commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Results (mmlb-200m warm, 8 concurrent, vs Lucene 10.4)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BubbleCal commented Jul 4, 2026 •

edited

Loading