Replace donfig with a statically-typed configuration object#4101
Replace donfig with a statically-typed configuration object#4101d-v-b wants to merge 27 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Add _ENV_META_VARS frozenset containing ZARR_CONFIG and skip those
names in collect_env() before stripping the prefix. This prevents
ZARR_CONFIG=/path/to/cfg.yaml from becoming {"config": "..."} which
crashed build_config() with KeyError because ZarrConfig has no
'config' field. Add two regression tests covering the skip and the
no-raise behaviour.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
After any reset() or set() call, a ContextVar scope entry is planted in the current context. Without re-syncing _scope in refresh(), a subsequent rebuild of _base is invisible in that context because _current() always prefers the scope entry. Fix refresh() to call self._scope.set(self._base) after rebuilding, matching the pattern already used in reset(). Also remove the dead contextlib.suppress(LookupError) in reset() — ContextVar.set() never raises LookupError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Remove the donfig Config subclass and its module-level instance. Promote ZarrConfigManager (the typed proxy built in Tasks 1-3) to the public `config` export. Update tests to the new nested-dict `defaults` form and fix a `codec_pipeline.name` typo (should be `codec_pipeline.path`). Also add **kwargs support to `set()` for call-sites that pass top-level keys as keyword arguments, and raise BadConfigError (ValueError) for removed deprecated keys to preserve the backwards-compat guarantee tested by `test_deprecated_config`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
`_apply_deprecation` now accepts a `raise_on_removed` parameter.
`set()` passes `raise_on_removed=True` (keeps raising `BadConfigError`
for removed keys). `get()` passes `raise_on_removed=False` so removed
keys are treated as absent: a caller-supplied default is returned, or
`KeyError` is raised when no default is given. This restores the
donfig-faithful behavior where `config.get("removed.key", default)`
never raised. Also removes the now-dead `if resolved is None: continue`
branch from `set()` and rewrites the module docstring to drop the
phrase "mirrors the old donfig interface".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
apply_overrides() now catches KeyError from replace_path() for each unknown key, emits a ZarrUserWarning naming the skipped key, and continues. This prevents a stray ZARR_*=... env var or extra YAML key from crashing build_config() (and therefore import zarr). config.set() remains strict — it calls replace_path() directly, not through apply_overrides(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
- Add _structured_leaf_keys helper (uses get_type_hints to handle from __future__ import annotations) and test_every_structured_key_has_a_get_overload that walks ZarrConfig recursively and asserts every leaf has a get() overload - Add TYPE_CHECKING smoke function using assert_type to prove precise static return types for config.get() and attribute access - Create changes/+statically-typed-config.misc.md changelog fragment - Update docs/user-guide/config.md: remove donfig prose, fix pprint call - Update docs/user-guide/installation.md: replace donfig dep with pyyaml - Pin pyyaml>=6 in pyproject.toml project dependencies Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
…ch path Remove the `k != "codecs"` guard in `_flatten_mapping` so a YAML `codecs:` block produces flat dotted keys (e.g. `codecs.bytes`) that flow through `_replace_recursive`'s Mapping branch and MERGE into the existing codec dict rather than replacing it wholesale. Also fix `_config_search_paths` to accept and consult the `environ` mapping supplied by `build_config` instead of reading `os.environ` directly, making `build_config(environ=...)` self-consistent. Adds regression tests (RED/GREEN) and a changelog note about `config.defaults` now returning a plain `dict` instead of donfig's `list[dict]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
The design spec and implementation plan live in a gist instead: https://gist.github.com/d-v-b/2a95ff0104824ef52545ed9baf1b66c3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4101 +/- ##
==========================================
+ Coverage 93.50% 93.51% +0.01%
==========================================
Files 90 90
Lines 11981 12262 +281
==========================================
+ Hits 11203 11467 +264
- Misses 778 795 +17
🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
…test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
set() takes Mapping[str, Any] (no static value validation) because precise
typing needs an open TypedDict (declared keys + arbitrary codecs.* str keys),
which mypy 2.x does not support; a closed TypedDict would break the open
config.set({'codecs.<name>': ...}) idiom. Revisit when mypy ships PEP 728 or
we adopt a checker that supports it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
Use object instead of Any for parameters/returns that merely store, pass
through, or compare values (path helpers, env/YAML ingest, set/update kwargs,
token, parse_indexing_order). Any is kept where it is load-bearing: the dynamic
dataclasses.replace/fields dispatch, the get() fallback overload that powers
config.get('codecs', {}).get(name), and the heterogeneous nested tree returned
by to_nested_dict/defaults/to_dict (navigated by key). object also let mypy
narrow parse_indexing_order, removing a redundant cast.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
|
I'm adding some useful error messages: 5 Traceback (most recent call last):
4 File "<stdin>", line 1, in <module>
3 File "/home/d-v-b/wt/statically-typed-config/src/zarr/core/config.py", line 483, in get
2 raise _unknown_key_error(key, current) from None
1 KeyError: "'arraya' is not a valid configuration key. Did you mean 'array'?" |
| def get(self, key: Literal["default_zarr_format"]) -> Literal[2, 3]: ... | ||
| @overload | ||
| def get(self, key: Literal["array.order"]) -> Literal["C", "F"]: ... | ||
| @overload | ||
| def get(self, key: Literal["array.write_empty_chunks"]) -> bool: ... | ||
| @overload | ||
| def get(self, key: Literal["array.read_missing_chunks"]) -> bool: ... | ||
| @overload | ||
| def get(self, key: Literal["array.target_shard_size_bytes"]) -> int | None: ... | ||
| @overload | ||
| def get(self, key: Literal["array.rectilinear_chunks"]) -> bool: ... | ||
| @overload | ||
| def get(self, key: Literal["array.sharding_coalesce_max_gap_bytes"]) -> int: ... | ||
| @overload | ||
| def get(self, key: Literal["array.sharding_coalesce_max_bytes"]) -> int: ... | ||
| @overload | ||
| def get(self, key: Literal["async.concurrency"]) -> int: ... | ||
| @overload | ||
| def get(self, key: Literal["async.timeout"]) -> float | None: ... | ||
| @overload | ||
| def get(self, key: Literal["threading.max_workers"]) -> int | None: ... | ||
| @overload | ||
| def get(self, key: Literal["json_indent"]) -> int: ... | ||
| @overload | ||
| def get(self, key: Literal["codec_pipeline.path"]) -> str: ... | ||
| @overload | ||
| def get(self, key: Literal["codec_pipeline.batch_size"]) -> int: ... | ||
| @overload | ||
| def get(self, key: Literal["buffer"]) -> str: ... | ||
| @overload | ||
| def get(self, key: Literal["ndbuffer"]) -> str: ... |
There was a problem hiding this comment.
flagging this as nice, but delicate (polite term for brittle). These overloads have to be kept synced up with the actual structure of the configuration, which is defined elsewhere. This PR adds tests that confirm that these overloaded signatures match the field name + type of the config object. I'm not sure how we can do do any better than that, if we want a string-based API like this to be statically typed.
an alternative that we can add later would be a non-string-based API, where we define an object that acts like a cursor / lens over our config.
| def test_config_codec_implementation(store: Store) -> None: | ||
| # has default value | ||
| assert fully_qualified_name(get_codec_class("blosc")) == config.defaults[0]["codecs"]["blosc"] | ||
| assert fully_qualified_name(get_codec_class("blosc")) == config.defaults["codecs"]["blosc"] |
There was a problem hiding this comment.
donfig returned defaults as a list, we just return a dict
config.get/config.set on an unknown key now raise a KeyError that names the key and suggests the most similar valid one (via difflib), e.g. 'array.0rder' -> "Did you mean 'array.order'?". Candidates are the schema's container and leaf keys plus current codecs.* entries. Still a KeyError, so existing handlers are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
…evel When an unknown key has no close match, list the available keys at the deepest resolvable level (capped at 10, '... (N more)' beyond). Suggestions are now scoped to the failed segment vs that level's children, which avoids misleading prefix matches (e.g. 'codecs.unknown' no longer suggests 'codecs.numcodecs.zstd'; it lists the codec names instead). Use explicit len()/!= '' checks rather than collection truthiness, and is_dataclass for narrowing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XKHgWSxDXtTmNgAebZg41U
The first `set`/`get` pair set `codec_pipeline.path` to the default `BatchedCodecPipeline` and asserted that default, so it passed regardless of whether `set` did anything — a no-op assertion donfig's permissiveness had masked (it also carried the original `.name` typo). The Mock pipeline block below already exercises set->get->actual-use with a non-default class, so this pair was redundant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018fnKiuy15kq7cPgRSKD3dr
|
I was motivated to open this PR because I think we need a more explicit data structure for our config. Donfig worked well but a morass of untyped dicts is not the UX we want if we plan to be agile with deprecations and feature additions. |
|
I really like this direction @d-v-b! Thanks for working on it. I'll give it a spin, likely on Monday. |
Summary
Replaces the donfig-based configuration with a statically-typed, dataclass-backed config in
src/zarr/core/config.py.zarr.confignow provides precise static types for both attribute access (zarr.config.array.order) and the dotted-string API (zarr.config.get("array.order")), while preserving donfig's runtime behavior.Motivation: donfig stores config as an untyped nested
dict, so every value is typedAnyand typos in keys go undetected. This models the config as frozen dataclasses (the single source of truth) and surfaces the familiar string API through hand-written@overloads that relate each dotted key to its value type.Design
ZarrConfig+ArraySettings/AsyncSettings/ThreadingSettings/CodecPipelineSettings).codecsis an openMapping[str, str]subtree so arbitrary codec names still register at runtime.@overloads map each dotted key ("array.order") to its value type (Literal["C", "F"]), plus astrfallback forcodecs.*/unknown keys. A drift-protection test asserts every structured key has a matching overload, so the two can't fall out of sync.asyncnamespace:asyncis a Python keyword, so the dataclass field isasync_while the serialized key stays"async"; the string API (config.get("async.concurrency")) is unaffected._basesnapshot plus aContextVarscope overlay.config.set(...)updates both;with config.set(...)restores both on exit. The global base is what makes a permanentsetvisible insideThreadPoolExecutorworkers (which don't copy contextvars).Backwards compatibility
The public
zarr.configsurface is preserved. Verified against donfig 0.8.1 and across all in-repo consumers (registry,sync,abc/codec,codec_pipeline,group,abc/store,array_spec, metadata v2/v3):config.get("a.b.c"[, default]), subtreeconfig.get("codecs", {}).get(key)config.set({...})(permanent) andwith config.set({...})(scoped), plus theconfig.set(key=value)kwargs formconfig.reset(),config.refresh(),config.enable_gpu(),config.defaultsZARR_FOO__BARenv-var ingestion and YAML config files (ZARR_CONFIG,~/.config/zarr)deprecationsmechanism (setraisesBadConfigErroron a removed key;gethonors the default — both donfig-faithful)Intentional, ratified behavior changes
config.set/config.getnow raise on an unknown structured key (typo detection — the point of the typed config), while env-var / YAML ingest tolerates unknown keys (skips with a warning) so a strayZARR_*var or version-skewed config file can't crashimport zarr. The opencodecs.*namespace still accepts arbitrary keys.config.defaultsnow returns a nesteddict(donfig returned a one-elementlist[dict]).Dependency change
donfigremoved (deps, lockfile, version-report).pyyaml>=6added (YAML support was previously transitive via donfig; the config module now reads YAML directly).Testing
mypy srcclean (strict).tests/test_config_typed.py: schema/path-helpers, env + YAML ingest (including the codecs-block deep-merge regression), the base+contextvar state model incl. cross-thread visibility, deprecations, the overload drift-protection test, andassert_typestatic-typing assertions.Notes for reviewers
🤖 Generated with Claude Code