Skip to content

Ranged scan hangs forever when it starts exactly at a write-batch boundary and reads ≥ ~256 MiB from that batch #7580

Description

@AyushExel

A scanner(offset=..., limit=...).to_table() (and take()) on a Lance v2.1 dataset never returns. count_rows(), full scans, and most other ranged reads on the same dataset work fine.

It happens when 2 things line up:

  1. The scan range starts exactly at the row where the long batch began (offset=68 hangs; offset=67 and offset=0 work).
  2. The requested range's total data is ≥ ~256 MiB across the selected columns

Two more observations that may help locate it:

  • Passing io_buffer_size=512MiB to the scanner makes the same read complete in ~1 s. With io_buffer_size=256MiB it still hangs. The ScannerBuilder.io_buffer_size docstring say the default is 2 GiB, but the hang appears at ~256 MiB on defaults.
  • The same rows written as uniform 64-row batches read fine with the identical offset/limit.

Environment

  • pylance 7.2.0-beta.4, pyarrow 24.0.0
  • data storage version 2.1, single fragment, single data file

Reproduction

import numpy as np, pyarrow as pa, lance

DIMS = (219648, 4096)          # FixedSizeList<f32> widths → 894,976 B/row
SHORT, LONG = 68, 300
SCHEMA = pa.schema([pa.field(f'wide_{i}', pa.list_(pa.float32(), d))
                    for i, d in enumerate(DIMS)])

def batch(n, rng):
    cols = []
    for f in SCHEMA:
        dim = f.type.list_size
        flat = rng.standard_normal(n * dim).astype(np.float32)
        cols.append(pa.FixedSizeListArray.from_arrays(pa.array(flat, type=pa.float32()), dim))
    return pa.record_batch(cols, schema=SCHEMA)

rng = np.random.default_rng(0)
reader = pa.RecordBatchReader.from_batches(SCHEMA, (batch(n, rng) for n in (SHORT, LONG)))
lance.write_dataset(reader, 'tbl.lance', schema=SCHEMA)

ds = lance.dataset('tbl.lance')
ds.count_rows()                                     # 368 — fine
ds.scanner(offset=68, limit=300).to_table()         # <-- never returns
Read Range size Result
offset=68, limit=300 256.06 MiB HANG
offset=68, limit=299 255.20 MiB ok, 0.8 s
offset=67, limit=300 256.06 MiB ok, 1.0 s
offset=1, limit=300 256.06 MiB ok, 1.3 s
offset=0, limit=300 256.06 MiB ok, 1.1 s
full scan (368 rows, 314 MiB) ok, 1.4 s
take(range(68, 368)) 256.06 MiB HANG
offset=68, limit=300, batch_size=16 256.06 MiB HANG
offset=68, limit=300, io_buffer_size=512MiB 256.06 MiB ok, 0.8 s
offset=68, limit=300, io_buffer_size=256MiB 256.06 MiB HANG
either column alone, offset=68, limit=300 ≤ 251 MiB ok
same rows written as uniform 64-row batches, offset=68, limit=300 256.06 MiB ok, 1.0 s

Column widths only matter through bytes-per-row: dims=(219648, 4000) → 255.95 MiB → ok; (219648, 4096) → 256.06 MiB → hangs. A single-column dataset hangs too once its range alone crosses the threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions