Summary
Make soft-delete a first-class lifecycle in DirectoryNamespace: separate a logical DROP (mark, reversible) from a physical PURGE (reclaim storage), with a TTL between them. This gives a grace window so concurrent readers don't fail when a table is dropped, lets a dropped table be restored before its TTL, and — by relocating the delete marker — keeps list_tables O(1) at scale instead of today's O(N) per-table probing.
This issue covers the namespace layer only (lance-namespace trait + DirectoryNamespace impl). Consumers (lancedb ListingDatabase/connection layer, periodic purge schedulers) are out of scope here and tracked separately; see Out of scope below.
Motivation
DirectoryNamespace::drop_table currently hard-deletes immediately (object_store.remove_dir_all). Two problems:
- Concurrent drop + read fails. An in-flight query holding open handles to a table that's being dropped sees its files vanish mid-read. There's no grace period and no way to separate "logically gone" from "bytes reclaimed."
- No DROP/PURGE separation, no TTL, no restore. Drop is purge. There's no deferred reclamation and no undo.
Separately, listing doesn't scale. list_tables in directory (non-manifest) mode does one read_dir(root) to find *.lance dirs, then a per-table read_dir(<name>.lance/) (check_table_status) to detect the nested .lance-deregistered / .lance-reserved markers — i.e. 1 + N object-store requests for N tables. This is a real performance cliff for large directories.
There's already a partial primitive: deregister_table writes a nested .lance-deregistered marker and check_table_status reads it, but it's a dead end — one-way (no restore; register_table is unsupported in directory mode), no purge, no TTL, and drop_table ignores it entirely.
Proposal
1. Move the delete marker to the namespace root (the key change)
Replace the nested marker (<root>/<name>.lance/.lance-deregistered) with a root-level sibling, e.g. <root>/<name>.deleted, whose body is small JSON: { "deleted_at_ms": ..., "ttl_ms": ... }.
Because the object store's read_dir is a single non-recursive listing that returns child "directories" (common prefixes) and direct child files, one read_dir(root) now reveals both the <name>.lance table dirs and the <name>.deleted markers. list_tables filters deleted names in-memory from that single listing → O(1) requests, eliminating the per-table probe. (Today's nested marker is invisible to the root listing, which is why the per-table probe exists.)
2. DROP vs PURGE in the namespace API
drop_table becomes soft: atomically create the <name>.deleted marker (stamped with deleted_at/ttl), leave data intact. Reversible.
purge_table / purge_tables(Option<Vec<Id>>) (new): the physical remove_dir_all + marker cleanup. None = purge all currently-purgable tables; a list = exactly those. Purge only ever acts on already-soft-deleted tables — never a live one.
list_purgable_tables(deleted_before: Option<Timestamp>) (new): returns soft-deleted tables and their deleted_at. The deleted_before cutoff lets a caller apply TTL policy while the namespace owns the (now O(1)) listing/filtering mechanism.
table_status(id) (new): returns one of Exists / SoftDeleted{deleted_at} / NotFound. (Deliberately not distinguishing "purged" from "never existed" — that would require retaining tombstones after purge, with their own GC. Purged == NotFound.)
3. Re-create over a soft-deleted table = overwrite
create_table for a name that has a .deleted marker should clear the marker and write with overwrite semantics (preserving lineage as a new version). Today create_table ignores the marker and fails with TableAlreadyExists if data is present.
4. Reads respect the marker
list_tables, describe_table, open/access paths treat a soft-deleted table as NotFound (clean, not an internal error).
5. Purge/revive race guard
A purge that fires at TTL must not delete data that a concurrent re-create just revived. Plain claim files are insufficient (no TTL on the claim → poisoned-lock risk; non-atomic check-then-delete). Proposal: use the marker itself as a compare-and-swap arbiter via conditional writes (PutMode::Update with etag): purge and revive contend on the single marker object; the loser aborts. This needs (a) conditional-write support confirmed across target object stores, and (b) PutMode::Update/etag plumbed through the object-store wrapper (only Create is used today). A dedicated concurrency test is a must — this is the one place a bug means data loss rather than a bounded-stale read.
Open questions (would love your take, @jackye1995)
- Marker naming/location.
<root>/<name>.deleted vs a reserved prefix vs encoding into the dir name. Anything that keeps the single-listing property.
- API shape.
purge_tables(Option<list>) (None = all) vs separate purge_table + purge_all_purgable. Where TTL policy lives (caller passes deleted_before vs namespace owns a TTL config).
table_status on the trait — worth adding as a first-class method, or keep it internal?
- Conditional-write reliability for the race guard across object stores (incl. self-hosted/MinIO). Fallback when CAS isn't available?
- Manifest mode. This issue is about directory (V1) mode. How should the same lifecycle look in manifest mode (where
deregister is a manifest-row delete)? Keep them separate, or unify the surface?
- Backward compatibility with any existing nested
.lance-deregistered markers — migrate, or read both during a transition?
Out of scope (tracked separately)
- lancedb consumer changes: routing the
ListingDatabase root path through the namespace (so root tables get this behavior and the duplicated native scan/remove_dir_all codepath is removed), resolving clone_table (unsupported in the namespace wrapper today), and warm-handle detection in the read-consistency wrapper.
- Purge scheduling (the periodic
list_purgable_tables(now − ttl) → purge_tables job) and any deployment integration.
Non-goals
Strong cross-process consistency. After a soft-delete, fresh opens and listings are immediately correct, but already-open/cached handles elsewhere may keep serving the table until they refresh — a bounded, eventual-consistency window consistent with the existing model. The TTL grace window is precisely what prevents in-flight reads from erroring; making "drop" instantly visible everywhere is not a goal here.
Summary
Make soft-delete a first-class lifecycle in
DirectoryNamespace: separate a logical DROP (mark, reversible) from a physical PURGE (reclaim storage), with a TTL between them. This gives a grace window so concurrent readers don't fail when a table is dropped, lets a dropped table be restored before its TTL, and — by relocating the delete marker — keepslist_tablesO(1) at scale instead of today's O(N) per-table probing.This issue covers the namespace layer only (
lance-namespacetrait +DirectoryNamespaceimpl). Consumers (lancedbListingDatabase/connection layer, periodic purge schedulers) are out of scope here and tracked separately; see Out of scope below.Motivation
DirectoryNamespace::drop_tablecurrently hard-deletes immediately (object_store.remove_dir_all). Two problems:Separately, listing doesn't scale.
list_tablesin directory (non-manifest) mode does oneread_dir(root)to find*.lancedirs, then a per-tableread_dir(<name>.lance/)(check_table_status) to detect the nested.lance-deregistered/.lance-reservedmarkers — i.e. 1 + N object-store requests for N tables. This is a real performance cliff for large directories.There's already a partial primitive:
deregister_tablewrites a nested.lance-deregisteredmarker andcheck_table_statusreads it, but it's a dead end — one-way (no restore;register_tableis unsupported in directory mode), no purge, no TTL, anddrop_tableignores it entirely.Proposal
1. Move the delete marker to the namespace root (the key change)
Replace the nested marker (
<root>/<name>.lance/.lance-deregistered) with a root-level sibling, e.g.<root>/<name>.deleted, whose body is small JSON:{ "deleted_at_ms": ..., "ttl_ms": ... }.Because the object store's
read_diris a single non-recursive listing that returns child "directories" (common prefixes) and direct child files, oneread_dir(root)now reveals both the<name>.lancetable dirs and the<name>.deletedmarkers.list_tablesfilters deleted names in-memory from that single listing → O(1) requests, eliminating the per-table probe. (Today's nested marker is invisible to the root listing, which is why the per-table probe exists.)2. DROP vs PURGE in the namespace API
drop_tablebecomes soft: atomically create the<name>.deletedmarker (stamped withdeleted_at/ttl), leave data intact. Reversible.purge_table/purge_tables(Option<Vec<Id>>)(new): the physicalremove_dir_all+ marker cleanup.None= purge all currently-purgable tables; a list = exactly those. Purge only ever acts on already-soft-deleted tables — never a live one.list_purgable_tables(deleted_before: Option<Timestamp>)(new): returns soft-deleted tables and theirdeleted_at. Thedeleted_beforecutoff lets a caller apply TTL policy while the namespace owns the (now O(1)) listing/filtering mechanism.table_status(id)(new): returns one ofExists/SoftDeleted{deleted_at}/NotFound. (Deliberately not distinguishing "purged" from "never existed" — that would require retaining tombstones after purge, with their own GC. Purged == NotFound.)3. Re-create over a soft-deleted table = overwrite
create_tablefor a name that has a.deletedmarker should clear the marker and write with overwrite semantics (preserving lineage as a new version). Todaycreate_tableignores the marker and fails withTableAlreadyExistsif data is present.4. Reads respect the marker
list_tables,describe_table,open/access paths treat a soft-deleted table asNotFound(clean, not an internal error).5. Purge/revive race guard
A purge that fires at TTL must not delete data that a concurrent re-create just revived. Plain claim files are insufficient (no TTL on the claim → poisoned-lock risk; non-atomic check-then-delete). Proposal: use the marker itself as a compare-and-swap arbiter via conditional writes (
PutMode::Updatewith etag): purge and revive contend on the single marker object; the loser aborts. This needs (a) conditional-write support confirmed across target object stores, and (b)PutMode::Update/etag plumbed through the object-store wrapper (onlyCreateis used today). A dedicated concurrency test is a must — this is the one place a bug means data loss rather than a bounded-stale read.Open questions (would love your take, @jackye1995)
<root>/<name>.deletedvs a reserved prefix vs encoding into the dir name. Anything that keeps the single-listing property.purge_tables(Option<list>)(None = all) vs separatepurge_table+purge_all_purgable. Where TTL policy lives (caller passesdeleted_beforevs namespace owns a TTL config).table_statuson the trait — worth adding as a first-class method, or keep it internal?deregisteris a manifest-row delete)? Keep them separate, or unify the surface?.lance-deregisteredmarkers — migrate, or read both during a transition?Out of scope (tracked separately)
ListingDatabaseroot path through the namespace (so root tables get this behavior and the duplicated native scan/remove_dir_allcodepath is removed), resolvingclone_table(unsupported in the namespace wrapper today), and warm-handle detection in the read-consistency wrapper.list_purgable_tables(now − ttl)→purge_tablesjob) and any deployment integration.Non-goals
Strong cross-process consistency. After a soft-delete, fresh opens and listings are immediately correct, but already-open/cached handles elsewhere may keep serving the table until they refresh — a bounded, eventual-consistency window consistent with the existing model. The TTL grace window is precisely what prevents in-flight reads from erroring; making "drop" instantly visible everywhere is not a goal here.