Skip to content

perf(arrow): zero-copy BFloat16Array::from(Vec<bf16>) via Buffer::from_vec#7614

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/bf16-from-vec-zero-copy
Open

perf(arrow): zero-copy BFloat16Array::from(Vec<bf16>) via Buffer::from_vec#7614
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/bf16-from-vec-zero-copy

Conversation

@LuciferYang

Copy link
Copy Markdown
Contributor

What

Follow-up to #7500 (wjones127's suggestion in #7500 (comment)). BFloat16Array::from(Vec<bf16>) still walked the input a second time, copying two bytes per element into a freshly allocated MutableBuffer. This hands the input Vec straight to Buffer::from_vec, so the existing allocation is reused as-is — no per-element copy, no MutableBuffer::extend loop.

Buffer::from_vec requires T: ArrowNativeType, which bf16 does not implement (arrow-rs registers the trait only for f16/f32/f64 among the float family; Apache Arrow has no bf16 primitive type, which is why Lance stores it as FixedSizeBinary(2)). The gap is bridged with bytemuck::cast_vec::<bf16, u16>, which reinterprets the allocation in place.

Why it is sound

  • bf16 is #[repr(transparent)] over u16 in half 2.7, so it has the same size (2) and alignment (2) as u16. cast_vec therefore hits its equal-size/equal-align path and reuses the allocation without realloc; a mismatch would panic, never reach UB.
  • half's bytemuck feature (enabled here) derives Zeroable + Pod on bf16, satisfying cast_vec's trait bounds.
  • The crate-root #[cfg(not(target_endian = "little"))] compile_error! added in docs(arrow): add SAFETY comments to lance-arrow unsafe blocks #7511 ties the reinterpretation to little-endian hosts, matching the FixedSizeBinary(2) byte order Lance writes elsewhere. Big-endian builds fail to compile, so the output is byte-identical to the previous per-element to_le_bytes() path.

Changes

  • From<Vec<bf16>>: replace the MutableBuffer copy loop with bytemuck::cast_vec + Buffer::from_vec.
  • half: add the bytemuck feature; add bytemuck (default-features = false, extern_crate_alloc) as a direct workspace dependency (already resolved transitively, now promoted).
  • Refresh python/Cargo.lock and java/lance-jni/Cargo.lock (the workspace-excluded lockfiles) to record bytemuck_derive.
  • as_slice's alignment doc/SAFETY comment previously asserted every value buffer comes from MutableBuffer (≥32-byte aligned); reworded to also cover the new Buffer::from_vec::<u16> path (2-byte aligned), both of which satisfy bf16's 2-byte requirement.

Tests

test_basics now pins the raw little-endian bytes emitted by From<Vec<bf16>> ([0x80,0x3F, 0x00,0x40, 0x40,0x40] for [1.0, 2.0, 3.0]) via FixedSizeBinaryArray::value, so a layout or byte-order regression is caught directly rather than only through Debug formatting. The prior assert_eq!(array, array2) compared two outputs of the same From impl and could not catch a regression.

cargo fmt, cargo clippy -p lance-arrow --tests --benches -- -D warnings, and cargo test -p lance-arrow (93 unit + 6 doc tests) are green.

…m_vec

Follow-up to PR lance-format#7500 (wjones127's suggestion:
lance-format#7500 (comment)).

The prior fix eliminated the per-element Vec<u8> allocation but still
walked the input a second time, copying two bytes per element into a
freshly allocated MutableBuffer. Hand the input Vec directly to
Buffer::from_vec so the allocation is reused as-is — no per-element
copy, no MutableBuffer::extend loop.

Buffer::from_vec requires `T: ArrowNativeType`, which bf16 does not
implement (arrow-rs registers the trait only for f16/f32/f64 among the
float family; Apache Arrow has no bf16 primitive type). Bridge the gap
with bytemuck::cast_vec::<bf16, u16>. Soundness comes from three facts:

1. bf16 is #[repr(transparent)] over u16 in half 2.7, so it has the same
   size, alignment, and layout as u16.
2. half's bytemuck feature (enabled here) derives Zeroable + Pod on
   bf16, and cast_vec's runtime size/align check therefore always
   passes.
3. The crate-root #[cfg(not(target_endian = "little"))] compile_error!
   (added in lance-format#7511) ties the reinterpretation to little-endian hosts,
   matching the FixedSizeBinary(2) byte order Lance writes elsewhere.
   Big-endian builds fail to compile.

Dependency additions:
- half: add "bytemuck" feature (workspace Cargo.toml).
- bytemuck: workspace dep with default-features = false + the
  "extern_crate_alloc" feature that cast_vec requires. bytemuck 1.25.0
  is already resolved transitively; this promotes it to a direct
  dependency.
- rust/lance-arrow/Cargo.toml: pull in bytemuck.
- python/Cargo.lock and java/lance-jni/Cargo.lock refreshed to record
  bytemuck_derive (the excluded lockfiles per the workspace's three-
  lockfile rule).

Verified: cargo fmt, cargo clippy -p lance-arrow --tests --benches
-- -D warnings, cargo test -p lance-arrow (93 unit + 6 doc tests).
Existing test_basics pins BFloat16Array::from(Vec) == from_iter_values,
so behavior is regression-tested.
@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI A-deps Dependency updates performance labels Jul 3, 2026
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@LuciferYang

Copy link
Copy Markdown
Contributor Author

cc @wjones127 , ‌Refer to the previous discussion‌: #7500 (comment) , I've temporarily adopted Option 2. If you think Option 1 should be implemented, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-java Java bindings + JNI A-python Python bindings performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant