feat: add sparse structural encoding#7628
Draft
Xuanwo wants to merge 8 commits into
Draft
Conversation
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This draft PR adds the Lance file version 2.3 sparse structural layout for sparse nested data.
Sparse structural metadata represents nested positions and counts semantically instead of relying on dense rep/def buffers, with explicit support for empty/all/range position sets and empty/constant/explicit list counts. The writer can emit sparse layout for structural pages that exceed mini-block rep/def limits, while v2.2 remains stable on existing layouts.
The intent is to make sparse nested structures a first-class storage layout with smaller structural metadata, fewer pages, and better scan/take behavior on sparse workloads. The current draft still needs final review cleanup around protobuf tag allocation, reader-side version gating, and malformed sparse metadata validation before it should be marked ready.
Performance
100M-row S3 benchmark on
lance-bench-ec2. Lower raw size/latency is better. Each Lance table used 102 objects. Parquet is included for size only and was written as a single ZSTD-compressed Parquet file from the same Arrow batches.The tables report sparse advantage directly:
other / sparse. Values above1xmean sparse is smaller or faster. Bold ratios mark 10x+ gaps. A few HNSW random/take and Parquet-size cells are near parity or favor the other format; those are called out explicitly.Use cases:
hnsw:List<UInt32>adjacency-list shape. The first ~14.3M rows are non-empty with 32 values per row; the remaining rows are empty.uniform:List<UInt32>sparse singleton shape. One row every 10,000 rows is non-empty with 1 value; all other rows are empty.deep: deeply nested sparse shape:Struct<events: List<Struct<id: Int32, tags: List<Int32>, pair: FixedSizeList<Int32; 2>>>>. One row every 4,096 rows has 1-2 nested events, with nullable struct/list/fixed-size-list layers and mixed empty/null nested children.Size
Cold Reads
Warm Reads