Skip to content

fix(blob): read empty blobs correctly under physical projection#7599

Open
zhangyang0418 wants to merge 1 commit into
lance-format:mainfrom
zhangyang0418:fix/blob-empty-value-physical-projection-pr
Open

fix(blob): read empty blobs correctly under physical projection#7599
zhangyang0418 wants to merge 1 commit into
lance-format:mainfrom
zhangyang0418:fix/blob-empty-value-physical-projection-pr

Conversation

@zhangyang0418

Copy link
Copy Markdown
Contributor

Fixes #7598.

Summary

  • Track whether v2.1 blob descriptors actually have out-of-line data instead of inferring byte assignment from rep/def values.
  • Skip zero-length blob I/O ranges in the v2.0 legacy blob decoder and re-expand empty bytes in row order.
  • Preserve empty range result slots in FileScheduler::submit_request, matching the EncodingsIo contract.
  • Add Rust and Python regression coverage for empty non-null blobs scanned as binary.

Testing

  • cargo fmt --all
  • cargo test -p lance-io scheduler::tests::test_empty_ranges_preserve_result_order
  • cargo test -p lance-io scheduler::tests::test_split_coalesce
  • cargo test -p lance-encoding test_blob_round_trip_with_empty_value
  • cargo test -p lance-encoding previous::encodings::logical::blob::tests
  • make install from python/
  • uv run pytest python/tests/test_blob.py::test_scan_blob_as_binary_with_empty_value from python/
  • uv run pytest python/tests/test_blob.py -k "scan_blob_as_binary" from python/
  • uv run make lint from python/

Not run: root cargo clippy --all --tests --benches -- -D warnings was not completed before opening this PR.

An empty (non-null, zero-length) blob is encoded as a descriptor with
`position == 0, size == 0`. Reading a blob column as binary (physical
projection / BlobHandling::AllBinary) mishandled these on two independent
decode paths, causing every value after the first empty blob to be read
back empty (or the scan to error out entirely).

- v2.1 structural path
  (encodings/logical/primitive/blob.rs): byte buffers were assigned to
  blobs using `blob.def == 0`. An empty blob decodes to `rep == 0,
  def == 0`, identical to a real blob, so it stole a buffer meant for a
  following real blob and shifted every subsequent assignment. Track an
  explicit `has_data` flag (set when the descriptor had `size > 0`) and
  key byte assignment off that instead.

- v2.0 legacy path
  (previous/encodings/logical/blob.rs): one I/O range was submitted per
  row, including a degenerate `0..0` range for empty blobs. That range
  never "overlaps" anything in the scheduler's coalescing/reassembly
  step (`is_overlapping` is always false for an empty interval), so the
  scheduler dropped it and misaligned the bytes for every subsequent
  row. Submit ranges only for `size > 0` rows and splice empty `Bytes`
  back in for the skipped rows.

Existing round-trip coverage only exercised null values, never an empty
non-null value interleaved with real data. Add a Rust round-trip test
(covers v2.0 and v2.1) and a Python end-to-end scan test that reproduce
the regression and now pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added A-python Python bindings A-encoding Encoding, IO, file reader/writer bug Something isn't working labels Jul 3, 2026
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blob physical projection misreads empty non-null values

1 participant