prototype for negative caching in StoreCache#4042
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4042 +/- ##
==========================================
+ Coverage 93.50% 93.57% +0.06%
==========================================
Files 90 90
Lines 11981 12030 +49
==========================================
+ Hits 11203 11257 +54
+ Misses 778 773 -5
🚀 New features to boost your workflow:
|
|
IMO it might be better to use a single key: value mapping, where "present" and "absent" are two possible values, instead of completely separate mappings for present vs missing keys. 2 mappings opens the possibility that a key is cached AND marked as missing, which we should prevent structurally if we can |
|
(this is very cool btw, thank you for working on this) |
|
Thanks @d-v-b — I switched to the single key mapping, and I think it came out much cleaner. There's now a single Since both entries in the store are tracked the same way now, present and absent entries share one I switched the default to |
|
hi @espg, that's great. claude had some concerns about the new defaults + default TTL of infinity:
Some of these concerns are about the change in behavior. Since this is marked as experimental, I think that's not as big a deal, but worth an impact assessment I think |
|
I think a finite default max age is probably a good idea in any case, just to keep things bounded. |
Adds
opt-inopt-out negative caching tozarr.experimental.cache_store.CacheStore: when enabled, a full-key read that finds the key absent in the source store is remembered, so subsequent reads of that absent key returnNoneimmediately without a source round-trip. The remembered miss is evicted when the key is later written.Default off; no behavior change unlessDefault on (cache_missing=True.cache_missing=True, opt-out). Negative caching affects only full-key reads of keys absent in the source; results for keys that exist, byte-range reads, andexists()are unchanged. Passcache_missing=Falseto disable. Follows from discussion on #4028Note
Revised since first draft. Per @d-v-b's review, present and absent keys are now tracked in a single structure (a key can never be both cached and marked missing), and negative markers share the
max_sizebudget with cached values instead of being bounded by TTL alone. Edits below are shown withstrike-throughfor the prior text and bold for the update.Motivation
CacheStorecaches present values only. On a full-key miss it deletes any stale entry and stores nothing, so a key absent in the source is a permanent cache miss — every read re-pays a source round-trip. This is the dominant cost when reading sparse arrays through aCacheStore: most chunks are empty, and the positive cache structurally cannot help (there is no value to store, and "not in cache" is indistinguishable from "not cached yet"). Negative caching closes that gap.It is intentionally narrow: it benefits the stock
arr[:]path (which probes every chunk) read repeatedly through aCacheStore. Code using the #4028 discovery primitives (zarr.shards_initialized/zarr.read_regions) never issues the empty-chunk reads in the first place and does not need this.API
cache_missing: bool = True— remember full-key misses (opt-out).cache_stats()gainsnegative_hits;cache_info()gainscache_missingandmissing_keys.No new bounding parameter is introduced: remembered misses are bounded by the existingNo new bounding parameter is introduced: negative markers share the existingmax_age_seconds, mirroring how the positive cache is bounded bymax_size.max_sizebudget with cached values (each charged a small flat overhead for its index slot), so a singlemax_sizebounds total cache memory. Under memory pressure miss-markers are evicted before any cached value, and a marker never displaces cached data. Markers also respectmax_age_seconds. Whenmax_size is Noneboth caches are unbounded (as today), so set a finitemax_sizeand/ormax_age_secondsfor scans over very large sparse key spaces.Note
cache_info()["current_size"](and__repr__) now include the small flat overhead charged per negative marker, since markers share themax_sizebudget.Design
CacheStorewraps a whole store and sees opaque keys (no chunk-grid knowledge), so negative knowledge is tracked per full keyin a smallPresent and absent keys live in onedict[str, float](key → insert time). Negative entries carry no bytes and are kept out of themax_sizebyte budget, so they never evict real cached data.OrderedDict[key, _Entry](_Entry = insert_time, size, present), so a key occupies exactly one slot and can never be simultaneously cached and marked missing — the invariant is structural, not maintained by eviction bookkeeping. Negative markers are charged a small flat overhead againstmax_sizebut are strictly lower priority than cached data: they are evicted first under pressure (LRU), only ever displace other markers when recording, and never evict a cached value.max_age_seconds, so a key written to the source out-of-band becomes visible again after expiry.Like the positive cache (unbounded whenWithmax_size is None), the negative cache is bounded only bymax_age_seconds; with an infinite TTL a scan over a very large sparse key space accumulates one small entry per absent key, so set a finite TTL (orcache_missing=False) for such workloads.max_sizeset, negatives are additionally bounded by the shared byte budget (see API). Withmax_size is Noneand an infinite TTL, a scan over a very large sparse key space accumulates one small entry per absent key, so set a finitemax_size/TTL (orcache_missing=False) for such workloads. This is called out in the docstring.setand an overriddenset_if_not_existsdrop any remembered miss for the key (reclaiming its charged bytes).deletedoes not create one (a delete is a mutation, not a checked-absence read).exists()are unchanged.exists()deliberately does not consult the negative cache (the defaultset_if_not_existscallsexiststhenset; a stale "missing" there could overwrite present data).negative_hitsand counts as neither a hit nor a miss, so the positivehit_rateis unaffected.Correctness notes
max_age_seconds="infinity"a remembered miss never expires, so a key written by another process stays invisible through the cache until eviction-on-write. Paircache_missing=Truewith a finitemax_age_secondswhen the source may be written concurrently.getruns outside the state lock, so a concurrentsetcan land between the source returningNoneand the miss being recorded. This is the same window the positive cache already has; it is TTL-bounded and self-heals. Documented as a known limitation rather than over-engineered away.Testing
tests/test_experimental/test_cache_store.py— newTestCacheStoreNegativeCaching: enabled-by-default andcache_missing=Falsedisable, basic negative hit (asserts the source is hit exactly once viamonkeypatch), eviction onsetandset_if_not_exists, TTL expiry with an out-of-band source write, byte-range reads unaffected, stats/info surfacing, anddeletedoes not record. Additional tests cover the shared-budget behaviour: a negative marker is charged againstmax_size, misses are bounded by the budget (LRU eviction of the oldest markers), markers are evicted before cached values, and the marker charge is reclaimed onset/set_if_not_exists(nocurrent_sizeleak). The existingtest_cache_infokey-set assertion is updated for the two new info keys. Full suite:54 passed59 passed; ruff, mypy (strict), and numpydoc clean.TODO:
docs/user-guide/*.mdchanges/(changes/4042.feature.md)