feat: support multi-base tables in merge insert with target base routing by jackye1995 · Pull Request #7610 · lance-format/lance

jackye1995 · 2026-07-03T07:00:19Z

Summary

Merge insert previously hard-coded target_bases: None on all of its write paths.
This PR makes merge insert work with multi-base datasets and adds target base routing:

MergeInsertBuilder::target_bases(Vec<u32>) and target_base_names_or_paths(Vec<String>)
in Rust; MergeInsertBuilder.target_bases(List[str]) in Python, with the same
name-or-path semantics as write_dataset(target_bases=...).
New fragments are distributed round-robin across the target bases with
DataFile.base_id stamped, identically to a normal write, on all three write paths:
the v2 plan (FullSchemaMergeInsertExec), the legacy indexed full-schema path, and
the legacy partial-schema path.
Column patch files (partial-schema in-place updates) and deletion files always stay
in primary storage, consistent with dataset.update() and dataset.delete().
Target bases are resolved per retry attempt against the refreshed manifest, reusing
validate_and_resolve_target_bases (validation, name/path lookup, per-base
credentials).
Delete-only merges write no data files but still validate requested target bases.
Omitting target_bases keeps the existing default: new files go to primary storage.

Also fixes a pre-existing panic: an empty target_bases list hit remainder-by-zero in
round-robin writer selection; it is now rejected with a descriptive error.

Tests cover merge insert on multi-base tables across all execution paths (with and
without routing), both base layouts (is_dataset_root true/false), round-robin
distribution across multiple files, mixed-base fragments from column patches, and
validation errors, in both Rust and Python.

Base id 0 (PRIMARY_BASE_ID) — or an entry equal to the dataset URI in the names
variant — includes the dataset's primary storage in the rotation, so
target_bases([0, 1, 2]) spreads new files across primary plus bases 1 and 2. Applies
to normal writes, fragment writes, and merge insert.
target_all_bases(include_primary) convenience (WriteParams, MergeInsertBuilder,
Python write_dataset/write_fragments/merge builder; include_primary defaults to
True in Python): resolves at execution time to every registered base, primary first
when included; CREATE mode includes initial_bases in the rotation.

Follow-ups (out of scope)

Java bindings do not expose target_bases yet — Java has no multi-base write support
at all today; parity should come with general Java multi-base write support.

codecov · 2026-07-03T08:11:14Z

Codecov Report

❌ Patch coverage is 88.03318% with 101 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/write/merge_insert.rs	64.40%	59 Missing and 4 partials ⚠️
...lance/src/dataset/write/merge_insert/exec/write.rs	47.50%	19 Missing and 2 partials ⚠️
rust/lance/src/dataset/write.rs	97.48%	9 Missing and 5 partials ⚠️
rust/lance/src/dataset/optimize.rs	77.77%	2 Missing ⚠️
rust/lance/src/dataset/fragment/write.rs	97.91%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Merge insert now works with multi-base datasets and can route new fragments across target bases round-robin like a normal write, via MergeInsertBuilder::target_bases / target_base_names_or_paths in Rust and MergeInsertBuilder.target_bases in Python. Column patch files and deletion files stay in primary storage. Also reject empty target base lists in validate_and_resolve_target_bases, which previously panicked.

Extend cleanup_data_fragments to delete files whose base_id matches a resolved target base via that base's object store, wire it through do_write_fragments, the legacy merge path, and add missing post-write failure cleanup in FullSchemaMergeInsertExec.

… midway A task failure in update_fragments previously returned early, leaving already-written new fragments (including ones routed to target bases) behind. Abort in-flight tasks and delete collected new fragments before surfacing the error.

Reserve base id 0 (PRIMARY_BASE_ID) to refer to the dataset's primary storage in target_bases, and resolve a name-or-path entry equal to the dataset URI the same way, so writes and merge inserts can round-robin across primary storage plus registered bases, e.g. target_bases([0, 1, 2]).

…mmit conflict A RetryableCommitConflict discards the attempt and re-executes it, so the attempt's data files are provably uncommitted; delete them (including files routed to target bases, which version cleanup never scans) before retrying. Other commit errors may be ambiguous about whether the manifest was written, so files are left alone there.

target_all_bases(include_primary) resolves at execution time to every base registered in the manifest, with the dataset's primary storage as the first rotation slot when included. Available on WriteParams (with_target_all_bases), MergeInsertBuilder, and in Python on write_dataset, write_fragments, and merge_insert (include_primary defaults to True). CREATE mode includes initial_bases in the rotation.

…dataset FragmentCreateBuilder skipped target base resolution when only target_all_bases was set, silently writing to primary storage. Also resolve conflict-retry cleanup against the refreshed per-attempt dataset instead of the executor's original one, so bases added between attempts still resolve, and add target_all_bases to the _write_fragments stubs.

github-actions Bot added A-python Python bindings enhancement New feature or request labels Jul 3, 2026

BubbleCal approved these changes Jul 3, 2026

View reviewed changes

jackye1995 added 3 commits July 3, 2026 07:20

jackye1995 force-pushed the merge-insert-multi-base branch from 2142dab to b61ca17 Compare July 3, 2026 14:26

jackye1995 added 4 commits July 3, 2026 08:06

jackye1995 merged commit 24b085f into lance-format:main Jul 3, 2026
30 checks passed

jackye1995 deleted the merge-insert-multi-base branch July 3, 2026 17:27

sezruby mentioned this pull request Jul 4, 2026

feat(delete): distributed fragment-scoped delete + execute_batch delete support sezruby/lance#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support multi-base tables in merge insert with target base routing#7610

feat: support multi-base tables in merge insert with target base routing#7610
jackye1995 merged 7 commits into
lance-format:mainfrom
jackye1995:merge-insert-multi-base

jackye1995 commented Jul 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jackye1995 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Follow-ups (out of scope)

Uh oh!

codecov Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 commented Jul 3, 2026 •

edited

Loading

codecov Bot commented Jul 3, 2026 •

edited

Loading