A scanner(offset=..., limit=...).to_table() (and take()) on a Lance v2.1 dataset never returns. count_rows(), full scans, and most other ranged reads on the same dataset work fine.
It happens when 2 things line up:
- The scan range starts exactly at the row where the long batch began (
offset=68 hangs; offset=67 and offset=0 work).
- The requested range's total data is ≥ ~256 MiB across the selected columns
Two more observations that may help locate it:
- Passing
io_buffer_size=512MiB to the scanner makes the same read complete in ~1 s. With io_buffer_size=256MiB it still hangs. The ScannerBuilder.io_buffer_size docstring say the default is 2 GiB, but the hang appears at ~256 MiB on defaults.
- The same rows written as uniform 64-row batches read fine with the identical offset/limit.
Environment
- pylance
7.2.0-beta.4, pyarrow 24.0.0
- data storage version
2.1, single fragment, single data file
Reproduction
import numpy as np, pyarrow as pa, lance
DIMS = (219648, 4096) # FixedSizeList<f32> widths → 894,976 B/row
SHORT, LONG = 68, 300
SCHEMA = pa.schema([pa.field(f'wide_{i}', pa.list_(pa.float32(), d))
for i, d in enumerate(DIMS)])
def batch(n, rng):
cols = []
for f in SCHEMA:
dim = f.type.list_size
flat = rng.standard_normal(n * dim).astype(np.float32)
cols.append(pa.FixedSizeListArray.from_arrays(pa.array(flat, type=pa.float32()), dim))
return pa.record_batch(cols, schema=SCHEMA)
rng = np.random.default_rng(0)
reader = pa.RecordBatchReader.from_batches(SCHEMA, (batch(n, rng) for n in (SHORT, LONG)))
lance.write_dataset(reader, 'tbl.lance', schema=SCHEMA)
ds = lance.dataset('tbl.lance')
ds.count_rows() # 368 — fine
ds.scanner(offset=68, limit=300).to_table() # <-- never returns
| Read |
Range size |
Result |
offset=68, limit=300 |
256.06 MiB |
HANG |
offset=68, limit=299 |
255.20 MiB |
ok, 0.8 s |
offset=67, limit=300 |
256.06 MiB |
ok, 1.0 s |
offset=1, limit=300 |
256.06 MiB |
ok, 1.3 s |
offset=0, limit=300 |
256.06 MiB |
ok, 1.1 s |
| full scan (368 rows, 314 MiB) |
— |
ok, 1.4 s |
take(range(68, 368)) |
256.06 MiB |
HANG |
offset=68, limit=300, batch_size=16 |
256.06 MiB |
HANG |
offset=68, limit=300, io_buffer_size=512MiB |
256.06 MiB |
ok, 0.8 s |
offset=68, limit=300, io_buffer_size=256MiB |
256.06 MiB |
HANG |
either column alone, offset=68, limit=300 |
≤ 251 MiB |
ok |
same rows written as uniform 64-row batches, offset=68, limit=300 |
256.06 MiB |
ok, 1.0 s |
Column widths only matter through bytes-per-row: dims=(219648, 4000) → 255.95 MiB → ok; (219648, 4096) → 256.06 MiB → hangs. A single-column dataset hangs too once its range alone crosses the threshold.
A
scanner(offset=..., limit=...).to_table()(andtake()) on a Lance v2.1 dataset never returns.count_rows(), full scans, and most other ranged reads on the same dataset work fine.It happens when 2 things line up:
offset=68hangs;offset=67andoffset=0work).Two more observations that may help locate it:
io_buffer_size=512MiBto the scanner makes the same read complete in ~1 s. Withio_buffer_size=256MiBit still hangs. TheScannerBuilder.io_buffer_sizedocstring say the default is 2 GiB, but the hang appears at ~256 MiB on defaults.Environment
7.2.0-beta.4, pyarrow24.0.02.1, single fragment, single data fileReproduction
offset=68, limit=300offset=68, limit=299offset=67, limit=300offset=1, limit=300offset=0, limit=300take(range(68, 368))offset=68, limit=300, batch_size=16offset=68, limit=300, io_buffer_size=512MiBoffset=68, limit=300, io_buffer_size=256MiBoffset=68, limit=300offset=68, limit=300Column widths only matter through bytes-per-row:
dims=(219648, 4000)→ 255.95 MiB → ok;(219648, 4096)→ 256.06 MiB → hangs. A single-column dataset hangs too once its range alone crosses the threshold.