Skip to content

feat(cache): support batch get-or-insert single-flight#7523

Closed
zouhuajian wants to merge 1 commit into
lance-format:mainfrom
zouhuajian:feat/cache-get-or-insert-batch
Closed

feat(cache): support batch get-or-insert single-flight#7523
zouhuajian wants to merge 1 commit into
lance-format:mainfrom
zouhuajian:feat/cache-get-or-insert-batch

Conversation

@zouhuajian

@zouhuajian zouhuajian commented Jun 30, 2026

Copy link
Copy Markdown

Feature

What is the new feature?

This adds backend-level batch get-or-insert support to Lance cache primitives.

The new API allows callers to provide one batch loader for multiple missing keys while preserving backend-owned per-key single-flight semantics. This is intended as cache-layer groundwork for callers such as coalesced BTree page reads, where we need both:

  • coalesced loading: one loader can load multiple missing pages together
  • single-flight: concurrent cold overlapping requests do not duplicate loading for the same key

Related to #6759.

Why do we need this feature?

CacheBackend::get_or_insert currently supports one key at a time. That works for single-key cache misses because the backend can deduplicate concurrent loaders for the same key.

For batch-oriented callers, neither existing pattern provides both desired properties:

  • calling single-key get_or_insert in a loop preserves single-flight, but loses batch/coalesced loading
  • doing manual get + batch load + insert preserves batch loading, but bypasses backend single-flight

Under concurrent cold overlapping range reads, the manual batch pattern can duplicate page IO and materialization for the same logical cache keys.

How does it work?

  • Adds low-level batch cache types and API:
    • CacheLoadedEntry
    • CacheBatchEntry
    • CacheBatchLoader
    • CacheBackend::get_or_insert_many
  • Adds typed cache APIs:
    • LanceCache::get_or_insert_with_key_batch
    • WeakLanceCache::get_or_insert_with_key_batch
    • CacheBatchValue<T>
  • Implements true batch single-flight in MokaCacheBackend.
  • Uses one shared in-flight registry for both single-key and batch get-or-insert paths, so mixed callers still coalesce on the same key.
  • Ensures the batch loader only receives missing keys currently owned by this caller.
  • Preserves input key order in returned values.
  • Rejects duplicate input keys with InvalidInput.
  • Validates loader output exactly:
    • missing returned key => InvalidInput
    • extra returned key => InvalidInput
    • duplicate returned key => InvalidInput
  • Clarifies was_cached semantics:
    • true means this call did not run the loader for that key
    • this includes ordinary cache hits and values produced by another in-flight owner
  • Keeps the default trait implementation as a compatibility fallback that calls single-key get_or_insert one key at a time.

Compatibility

This is an additive cache API change.

Existing single-key get_or_insert behavior remains compatible. The default get_or_insert_many implementation preserves existing backend single-key semantics for custom backends, but does not preserve coalesced batch loading unless a backend overrides it.

This PR does not change BTree query logic directly. It only adds the cache primitive needed by future multi-page miss paths.

Testing

Local verification run on commit 50dd286d2:

  • cargo fmt --all -- --check
  • cargo test -p lance-core cache
    • 38 passed
  • cargo test -p lance-core --doc cache
    • 3 passed, 2 ignored
  • cargo test -p lance-core --bench cache_batch -- --test
    • all benchmark scenarios reported Success
  • cargo clippy -p lance-core --all-targets -- -D warnings
  • cargo bench -p lance-core --bench cache_batch -- --noplot

Benchmarks

These are cache primitive microbenchmarks, not end-to-end BTree or object-store benchmarks. They intentionally use small usize values and cheap / yielding loaders so the cache coordination overhead and duplicate-load behavior are visible.

Benchmark environment:

  • MacBook Pro, Apple M1 Pro, 16 GiB memory
  • macOS 15.6, arm64
  • Rust 1.94.0
  • Command: cargo bench -p lance-core --bench cache_batch -- --noplot
  • Criterion config: sample size 10, 100 ms warmup, 250 ms measurement time

Each row runs multiple concurrent tasks against the same cache. Batch size
is the number of keys requested by each task. Concurrent tasks is the number
of simultaneous cache callers. Key overlap controls how many keys are shared
across those concurrent callers.

Actual loads counts how many entries were actually produced by loaders across
all concurrent tasks. Lower load counts indicate less duplicate cold-miss work.

Batch size Concurrent tasks Key overlap Loader Actual loads: single / manual / new batch Single-key loop Manual get + batch load + insert Batch get_or_insert_many New vs manual
64 8 50% cheap 288 / 512 / 288 0.961 ms 0.843 ms 1.258 ms 1.49x
64 8 100% cheap 64 / 512 / 64 0.395 ms 0.851 ms 0.702 ms 0.82x
64 8 50% yield_once 288 / 512 / 288 1.269 ms 0.827 ms 1.335 ms 1.61x
64 8 100% yield_once 64 / 512 / 64 0.660 ms 0.936 ms 0.718 ms 0.77x
8 32 50% cheap 132 / 256 / 132 0.523 ms 0.532 ms 0.625 ms 1.18x
64 8 0% cheap n/a / 512 / 512 n/a 0.671 ms 1.815 ms 2.71x
64 8 0% yield_once n/a / 512 / 512 n/a 0.700 ms 1.894 ms 2.71x
512 32 0% cheap n/a / 16384 / 16384 n/a 20.414 ms 61.045 ms 2.99x

Interpretation:

  • The new batch API is not intended to be a universal faster path for cheap disjoint misses.
  • In no-overlap scenarios, it has expected overhead from per-key flight coordination, validation, and result ordering.
  • The value is semantic: overlapping cold batch requests preserve batch/coalesced loading while avoiding duplicate loads for shared keys.
  • In full-overlap scenarios, the new batch API avoids the duplicate work in the manual get + load + insert pattern and is faster in this microbenchmark.
  • The single-key loop still has strong single-flight behavior, but it cannot give a multi-key loader the full missing-key set for coalesced IO.

Add CacheBackend::get_or_insert_many and typed LanceCache /
WeakLanceCache batch get-or-insert APIs.

Implement true per-key single-flight for MokaCacheBackend using a shared
in-flight registry across single-key and batch paths. Batch loaders only
receive keys owned by the current caller, results preserve input order,
duplicate input keys are rejected, and loader output must exactly match the
owned keys.

Add tests for overlapping batches, mixed single/batch callers, owner failure
retry, validation errors, WeakLanceCache fallback behavior, and cache stats.
Add Criterion microbenchmarks for cache-level batch get-or-insert tradeoffs.
@github-actions github-actions Bot added A-deps Dependency updates enhancement New feature or request labels Jun 30, 2026
@wjones127 wjones127 self-requested a review June 30, 2026 15:03

@wjones127 wjones127 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling single-key get_or_insert in a loop preserves single-flight, but loses batch/coalesced loading

I don't understand this part. Why can't you call each key with get_or_insert() concurrently? Moka already handles the multiple in-flight requests?

My impression is this PR adds a lot of complexity to the cache API and implementation. So we'll need a good reason to add it.

I think the actual benchmark comparison I'd like to see is get_or_insert_with_key_batch vs get_or_insert_with_key concurrently for each key. Comparing vs a single threaded loop or the manual get insert feels like a straw man.

})?;
Ok((loaded.entry, loaded.size_bytes))
});
let (entry, was_cached) = self.get_or_insert(&key, single_loader, codec).await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: get_or_insert is async, so there's no reason you couldn't do this for all keys concurrently. Unless course, there's some reason that you don't want to call loader beyond a certain concurrency, but I suppose it's on the caller to call with a small enough batch.

@zouhuajian

Copy link
Copy Markdown
Author

I don't understand this part. Why can't you call each key with get_or_insert() concurrently? Moka already handles the multiple in-flight requests?

My impression is this PR adds a lot of complexity to the cache API and implementation. So we'll need a good reason to add it.

I think the actual benchmark comparison I'd like to see is get_or_insert_with_key_batch vs get_or_insert_with_key concurrently for each key. Comparing vs a single threaded loop or the manual get insert feels like a straw man.

Thanks for the review.

The original motivation for this PR came from #6759. That PR tried to coalesce BTree page_data reads by physical file/range, but the manual get + batch load + insert path would bypass the existing per-key single-flight behavior in get_or_insert.

After looking more carefully at the scalar index code, I don’t think this is a good generic cache abstraction. The concrete need is BTree-specific, and I also closed #6759 because I don’t yet have enough evidence that broad cold BTree range reads are common enough to justify this complexity.

Bitmap and FTS/FST do not seem to need this generic batch primitive: bitmap caches per value, and FTS already has grouped posting-list cache keys.

So I’m going to close this PR. If this comes back, I think it should start with BTree-specific metrics and a narrower page-materialization design.

@zouhuajian zouhuajian closed this Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants