Skip to content

Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077

Draft
chuckwondo wants to merge 3 commits into
zarr-developers:mainfrom
chuckwondo:docs/from-zero-to-zarr
Draft

Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077
chuckwondo wants to merge 3 commits into
zarr-developers:mainfrom
chuckwondo:docs/from-zero-to-zarr

Conversation

@chuckwondo

Copy link
Copy Markdown
Contributor

Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose

  • diagrams throughout, with executable, build-verified code in the final section, and every spec detail linked to its section of the Zarr v3 spec.

Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav.

Closes #4056

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Adds a new user-guide page (docs/user-guide/data_model.md, nav label
"Understanding Zarr") that explains the Zarr data model for newcomers:
why Zarr exists (its parallel-computing origin in genomics), then arrays,
chunking and the chunk grid, stores as key->bytes maps, metadata
(zarr.json), the specification, codecs, sharding, groups, and N-D arrays,
ending with a runnable round-trip example and a cross-language note. Prose
+ diagrams throughout, with executable, build-verified code in the final
section, and every spec detail linked to its section of the Zarr v3 spec.

Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds
the page to the User Guide nav.

Closes zarr-developers#4056

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chuckwondo chuckwondo requested review from d-v-b and maxrjones June 17, 2026 23:14

Chunking is the key move. Each chunk can be stored, loaded, and compressed on its
own, so a program can read just the chunks it needs — that one corner your
colleague wanted — without touching the rest. (Starting with a chunk shape that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use an inline admonition for the partial-chunk callout? something like

Note

If each chunk has a fixed size, how can we use chunks to represent an array that isn't evenly divided by the chunk size? See #section for the answer to that question!

not sure if note is the right admonition here

@d-v-b

d-v-b commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@mkitti if you have time it would be good to get your thoughts on this

G11 --> K11
```

Where does a key like `c/0/1` come from? It's built by a simple, fixed rule (the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fixed rule" locally implies that arrays have 1 chunk key encoding. maybe we can rephrase to make it clear that there's a rule defined by a particular field in array metadata.

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.50%. Comparing base (22818d9) to head (2b69319).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4077   +/-   ##
=======================================
  Coverage   93.50%   93.50%           
=======================================
  Files          90       90           
  Lines       11979    11979           
=======================================
  Hits        11201    11201           
  Misses        778      778           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxrjones maxrjones left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @chuckwondo! I just have some small nits

Comment on lines +7 to +9
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why** that layout is defined by a written specification, and **how a library
turns those stored bytes back into an array you can use**.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why** that layout is defined by a written specification, and **how a library
turns those stored bytes back into an array you can use**.
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why that layout is defined by a written specification**, and **how a library
turns those stored bytes back into an array you can use**.

nit about consistent use of bold text


---

## Why we need Zarr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be nice to have a tl;dr (maybe a note admonition) at the top of this section

extraordinary firehoses of numbers. A satellite streams images of the Earth; a
microscope captures gigapixel scans; a gene sequencer reads thousands of genomes;
a climate model writes out temperature and wind for every point on the globe, hour
after hour. In each case the result has the same shape: a vast grid of numbers —

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many people have a negative reaction to em-dashes since their proliferation by AI. It would likely be worth reducing their use in this guide via more, shorter sentences.

why, it helps to understand two things the array formats of the day were already
doing.

First, **chunking**. To store an array bigger than memory, formats like HDF5 and

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the [*Anopheles gambiae* 1000 Genomes Project](https://www.malariagen.net/)) —
arrays far too big to fit in memory. His real frustration was *speed*, and to see
why, it helps to understand two things the array formats of the day were already
doing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doing.
doing: chunking and compression.

If I'm reading this right, it's not totally obvious what "Second" is


So a 5×6 array chunked at `(2, 3)` quietly stores a row of "phantom" cells holding
the fill value. It's harmless, but it's a small waste — and a good reason to pick a
chunk shape that fits your array's real shape reasonably well. (For practical

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunk shape that fits your array's real shape reasonably well. (For practical
chunk shape that fits your array's real shape reasonably well and lean on the
[rectilinear chunk grid extension](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear) when needed. (For practical

[specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#codecs)
defines three kinds of codec, applied in this order:

1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a
1. **array → array** codecs (optional, any number) — rearrange or change the values; e.g. a

I believe this change is more accurate, but would appreciate if @d-v-b confirms

simple, but it has a limit: small chunks in a very large array produce a *huge*
number of chunks, and therefore a huge number of files or objects. The spec notes
this is exactly where file systems (block sizes, inode limits) and object stores
(which dislike millions of tiny objects) start to struggle.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more prevalent limitation on object stores is the cost model, where the cost of operations often scales with the number of objects

or more axes.

To see the generalisation concretely, picture a 3-D array as a **stack of 2-D
arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —
arrays**. Here are two versions of our 4×6 grid stacked into a `(2, 4, 6)` array —

- write it to the corresponding slice of the array,
- discard it, and move on to the next block.

Because only one block is ever in memory, the array on disk can be far larger than

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Because only one block is ever in memory, the array on disk can be far larger than
Because the minimum amount of data ever needed in memory to be useful is a single block, the array on disk can be far larger than

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs] very basic zarr tutorial -- "zarr for absolute beginners"

3 participants