Skip to content

feat: add sparse structural encoding#7628

Draft
Xuanwo wants to merge 8 commits into
mainfrom
xuanwo/sparse-structural-layout
Draft

feat: add sparse structural encoding#7628
Xuanwo wants to merge 8 commits into
mainfrom
xuanwo/sparse-structural-layout

Conversation

@Xuanwo

@Xuanwo Xuanwo commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

This draft PR adds the Lance file version 2.3 sparse structural layout for sparse nested data.

Sparse structural metadata represents nested positions and counts semantically instead of relying on dense rep/def buffers, with explicit support for empty/all/range position sets and empty/constant/explicit list counts. The writer can emit sparse layout for structural pages that exceed mini-block rep/def limits, while v2.2 remains stable on existing layouts.

The intent is to make sparse nested structures a first-class storage layout with smaller structural metadata, fewer pages, and better scan/take behavior on sparse workloads. The current draft still needs final review cleanup around protobuf tag allocation, reader-side version gating, and malformed sparse metadata validation before it should be marked ready.

Performance

100M-row S3 benchmark on lance-bench-ec2. Lower raw size/latency is better. Each Lance table used 102 objects. Parquet is included for size only and was written as a single ZSTD-compressed Parquet file from the same Arrow batches.

The tables report sparse advantage directly: other / sparse. Values above 1x mean sparse is smaller or faster. Bold ratios mark 10x+ gaps. A few HNSW random/take and Parquet-size cells are near parity or favor the other format; those are called out explicitly.

Use cases:

  • hnsw: List<UInt32> adjacency-list shape. The first ~14.3M rows are non-empty with 32 values per row; the remaining rows are empty.
  • uniform: List<UInt32> sparse singleton shape. One row every 10,000 rows is non-empty with 1 value; all other rows are empty.
  • deep: deeply nested sparse shape: Struct<events: List<Struct<id: Int32, tags: List<Int32>, pair: FixedSizeList<Int32; 2>>>>. One row every 4,096 rows has 1-2 nested events, with nullable struct/list/fixed-size-list layers and mixed empty/null nested children.

Size

use case sparse size sparse vs miniblock sparse vs fullzip sparse vs parquet_zstd
hnsw 1528.7 MiB 1.1x smaller 1.5x smaller 0.77x; Parquet is 1.3x smaller
uniform 116,192 B 26.5x smaller 4,277x smaller 4.7x smaller
deep 1.1 MiB 334x smaller 872x smaller 2.0x smaller
image

Cold Reads

use case operation sparse latency sparse vs miniblock sparse vs fullzip
hnsw full scan 2738.3 ms 1.2x faster 1.8x faster
hnsw random take 1024 535.4 ms 1.1x faster 0.99x; fullzip is 1.01x faster
uniform full scan 121.5 ms 1.4x faster 16.2x faster
uniform random take 1024 110.5 ms 1.7x faster 11.2x faster
deep full scan 234.7 ms 40.2x faster 47.9x faster
deep random take 1024 210.7 ms 21.0x faster 20.8x faster
image

Warm Reads

use case operation sparse latency sparse vs miniblock sparse vs fullzip
hnsw full scan 2037.8 ms 1.1x faster 1.6x faster
hnsw random take 1024 255.6 ms 1.02x faster 1.00x; parity
hnsw empty take 1024 0.6 ms 1.01x faster 1.1x faster
hnsw non-empty take 1024 41.9 ms 1.2x faster 0.94x; fullzip is 1.07x faster
uniform full scan 108.2 ms 8.1x faster 8.2x faster
uniform random take 1024 3.6 ms 41.1x faster 96.9x faster
uniform empty take 1024 0.2 ms 404x faster 180x faster
uniform non-empty take 1024 58.0 ms 1.8x faster 1.3x faster
deep full scan 193.0 ms 23.8x faster 23.4x faster
deep random take 1024 9.5 ms 85.9x faster 183x faster
deep empty take 1024 0.4 ms 353x faster 88.3x faster
deep non-empty take 1024 37.0 ms 7.1x faster 4.9x faster
image

@github-actions github-actions Bot added the enhancement New feature or request label Jul 4, 2026
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request and removed enhancement New feature or request labels Jul 4, 2026
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant