Skip to content

feat: support multi-base tables in merge insert with target base routing#7610

Merged
jackye1995 merged 7 commits into
lance-format:mainfrom
jackye1995:merge-insert-multi-base
Jul 3, 2026
Merged

feat: support multi-base tables in merge insert with target base routing#7610
jackye1995 merged 7 commits into
lance-format:mainfrom
jackye1995:merge-insert-multi-base

Conversation

@jackye1995

@jackye1995 jackye1995 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Merge insert previously hard-coded target_bases: None on all of its write paths.
This PR makes merge insert work with multi-base datasets and adds target base routing:

  • MergeInsertBuilder::target_bases(Vec<u32>) and target_base_names_or_paths(Vec<String>)
    in Rust; MergeInsertBuilder.target_bases(List[str]) in Python, with the same
    name-or-path semantics as write_dataset(target_bases=...).
  • New fragments are distributed round-robin across the target bases with
    DataFile.base_id stamped, identically to a normal write, on all three write paths:
    the v2 plan (FullSchemaMergeInsertExec), the legacy indexed full-schema path, and
    the legacy partial-schema path.
  • Column patch files (partial-schema in-place updates) and deletion files always stay
    in primary storage, consistent with dataset.update() and dataset.delete().
  • Target bases are resolved per retry attempt against the refreshed manifest, reusing
    validate_and_resolve_target_bases (validation, name/path lookup, per-base
    credentials).
  • Delete-only merges write no data files but still validate requested target bases.
  • Omitting target_bases keeps the existing default: new files go to primary storage.

Also fixes a pre-existing panic: an empty target_bases list hit remainder-by-zero in
round-robin writer selection; it is now rejected with a descriptive error.

Tests cover merge insert on multi-base tables across all execution paths (with and
without routing), both base layouts (is_dataset_root true/false), round-robin
distribution across multiple files, mixed-base fragments from column patches, and
validation errors, in both Rust and Python.

  • Base id 0 (PRIMARY_BASE_ID) — or an entry equal to the dataset URI in the names
    variant — includes the dataset's primary storage in the rotation, so
    target_bases([0, 1, 2]) spreads new files across primary plus bases 1 and 2. Applies
    to normal writes, fragment writes, and merge insert.
  • target_all_bases(include_primary) convenience (WriteParams, MergeInsertBuilder,
    Python write_dataset/write_fragments/merge builder; include_primary defaults to
    True in Python): resolves at execution time to every registered base, primary first
    when included; CREATE mode includes initial_bases in the rotation.

Follow-ups (out of scope)

  • Java bindings do not expose target_bases yet — Java has no multi-base write support
    at all today; parity should come with general Java multi-base write support.

@github-actions github-actions Bot added A-python Python bindings enhancement New feature or request labels Jul 3, 2026
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Merge insert now works with multi-base datasets and can route new
fragments across target bases round-robin like a normal write, via
MergeInsertBuilder::target_bases / target_base_names_or_paths in Rust
and MergeInsertBuilder.target_bases in Python. Column patch files and
deletion files stay in primary storage. Also reject empty target base
lists in validate_and_resolve_target_bases, which previously panicked.
Extend cleanup_data_fragments to delete files whose base_id matches a
resolved target base via that base's object store, wire it through
do_write_fragments, the legacy merge path, and add missing post-write
failure cleanup in FullSchemaMergeInsertExec.
… midway

A task failure in update_fragments previously returned early, leaving
already-written new fragments (including ones routed to target bases)
behind. Abort in-flight tasks and delete collected new fragments before
surfacing the error.
@jackye1995 jackye1995 force-pushed the merge-insert-multi-base branch from 2142dab to b61ca17 Compare July 3, 2026 14:26
Reserve base id 0 (PRIMARY_BASE_ID) to refer to the dataset's primary
storage in target_bases, and resolve a name-or-path entry equal to the
dataset URI the same way, so writes and merge inserts can round-robin
across primary storage plus registered bases, e.g. target_bases([0, 1, 2]).
…mmit conflict

A RetryableCommitConflict discards the attempt and re-executes it, so the
attempt's data files are provably uncommitted; delete them (including files
routed to target bases, which version cleanup never scans) before retrying.
Other commit errors may be ambiguous about whether the manifest was written,
so files are left alone there.
target_all_bases(include_primary) resolves at execution time to every
base registered in the manifest, with the dataset's primary storage as
the first rotation slot when included. Available on WriteParams
(with_target_all_bases), MergeInsertBuilder, and in Python on
write_dataset, write_fragments, and merge_insert (include_primary
defaults to True). CREATE mode includes initial_bases in the rotation.
…dataset

FragmentCreateBuilder skipped target base resolution when only
target_all_bases was set, silently writing to primary storage. Also
resolve conflict-retry cleanup against the refreshed per-attempt dataset
instead of the executor's original one, so bases added between attempts
still resolve, and add target_all_bases to the _write_fragments stubs.
@jackye1995 jackye1995 merged commit 24b085f into lance-format:main Jul 3, 2026
30 checks passed
@jackye1995 jackye1995 deleted the merge-insert-multi-base branch July 3, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants