Skip to content

feat: add sparse structural encoding#7628

Draft
Xuanwo wants to merge 8 commits into
mainfrom
xuanwo/sparse-structural-layout
Draft

feat: add sparse structural encoding#7628
Xuanwo wants to merge 8 commits into
mainfrom
xuanwo/sparse-structural-layout

Conversation

@Xuanwo

@Xuanwo Xuanwo commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

This draft PR adds the Lance file version 2.3 sparse structural layout for sparse nested data.

Sparse structural metadata represents nested positions and counts semantically instead of relying on dense rep/def buffers, with explicit support for empty/all/range position sets and empty/constant/explicit list counts. The writer can emit sparse layout for structural pages that exceed mini-block rep/def limits, while v2.2 remains stable on existing layouts.

The intent is to make sparse nested structures a first-class storage layout with smaller structural metadata, fewer pages, and better scan/take behavior on sparse workloads. The current draft still needs final review cleanup around protobuf tag allocation, reader-side version gating, and malformed sparse metadata validation before it should be marked ready.

Performance

100M-row S3 benchmark on lance-bench-ec2. Lower is better. Each Lance table used 102 objects. Ratios are relative to sparse; bold ratios mark 10x+ gaps. Parquet is included for size only and was written as a single ZSTD-compressed Parquet file from the same Arrow batches.

Use cases:

  • hnsw: List<UInt32> adjacency-list shape. The first ~14.3M rows are non-empty with 32 values per row; the remaining rows are empty.
  • uniform: List<UInt32> sparse singleton shape. One row every 10,000 rows is non-empty with 1 value; all other rows are empty.
  • deep: deeply nested sparse shape: Struct<events: List<Struct<id: Int32, tags: List<Int32>, pair: FixedSizeList<Int32; 2>>>>. One row every 4,096 rows has 1-2 nested events, with nullable struct/list/fixed-size-list layers and mixed empty/null nested children.

Size

use case sparse miniblock fullzip
hnsw 1528.7 MiB 1619.0 MiB (1.1x larger) 2236.5 MiB (1.5x larger)
uniform 116,192 B 2.9 MiB (26.5x larger) 473.9 MiB (4,277x larger)
deep 1.1 MiB 363.2 MiB (334x larger) 948.1 MiB (872x larger)

Parquet size-only comparison:

use case sparse parquet_zstd result
hnsw 1528.7 MiB 1183.0 MiB Parquet is 1.3x smaller
uniform 116,192 B 546,096 B sparse is 4.7x smaller
deep 1.1 MiB 2.2 MiB sparse is 2.0x smaller
image

Cold Reads

use case operation sparse miniblock fullzip
hnsw full scan 2738.3 ms 3213.4 ms (1.2x slower) 4845.0 ms (1.8x slower)
hnsw random take 1024 535.4 ms 563.4 ms (1.1x slower) 527.9 ms (1.01x faster)
uniform full scan 121.5 ms 166.3 ms (1.4x slower) 1966.7 ms (16.2x slower)
uniform random take 1024 110.5 ms 187.0 ms (1.7x slower) 1233.1 ms (11.2x slower)
deep full scan 234.7 ms 9439.2 ms (40.2x slower) 11232.7 ms (47.9x slower)
deep random take 1024 210.7 ms 4415.6 ms (21.0x slower) 4391.0 ms (20.8x slower)
image

Warm Reads

use case operation sparse miniblock fullzip
hnsw full scan 2037.8 ms 2316.9 ms (1.1x slower) 3330.4 ms (1.6x slower)
hnsw random take 1024 255.6 ms 261.0 ms (1.02x slower) 256.2 ms (1.00x slower)
hnsw empty take 1024 0.6 ms 0.6 ms (1.01x slower) 0.6 ms (1.1x slower)
hnsw non-empty take 1024 41.9 ms 52.0 ms (1.2x slower) 39.2 ms (1.07x faster)
uniform full scan 108.2 ms 875.7 ms (8.1x slower) 883.7 ms (8.2x slower)
uniform random take 1024 3.6 ms 147.8 ms (41.1x slower) 348.1 ms (96.9x slower)
uniform empty take 1024 0.2 ms 92.6 ms (404x slower) 41.3 ms (180x slower)
uniform non-empty take 1024 58.0 ms 105.6 ms (1.8x slower) 74.6 ms (1.3x slower)
deep full scan 193.0 ms 4584.4 ms (23.8x slower) 4508.8 ms (23.4x slower)
deep random take 1024 9.5 ms 811.7 ms (85.9x slower) 1727.8 ms (183x slower)
deep empty take 1024 0.4 ms 155.6 ms (353x slower) 38.9 ms (88.3x slower)
deep non-empty take 1024 37.0 ms 261.8 ms (7.1x slower) 179.9 ms (4.9x slower)
image

@github-actions github-actions Bot added the enhancement New feature or request label Jul 4, 2026
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request and removed enhancement New feature or request labels Jul 4, 2026
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant