Skip to content

Soft-delete for DirectoryNamespace: separate DROP/PURGE, add TTL, and make listing O(1) #7539

Description

@wjones127

Summary

Make soft-delete a first-class lifecycle in DirectoryNamespace: separate a logical DROP (mark, reversible) from a physical PURGE (reclaim storage), with a TTL between them. This gives a grace window so concurrent readers don't fail when a table is dropped, lets a dropped table be restored before its TTL, and — by relocating the delete marker — keeps list_tables O(1) at scale instead of today's O(N) per-table probing.

This issue covers the namespace layer only (lance-namespace trait + DirectoryNamespace impl). Consumers (lancedb ListingDatabase/connection layer, periodic purge schedulers) are out of scope here and tracked separately; see Out of scope below.

Motivation

DirectoryNamespace::drop_table currently hard-deletes immediately (object_store.remove_dir_all). Two problems:

  1. Concurrent drop + read fails. An in-flight query holding open handles to a table that's being dropped sees its files vanish mid-read. There's no grace period and no way to separate "logically gone" from "bytes reclaimed."
  2. No DROP/PURGE separation, no TTL, no restore. Drop is purge. There's no deferred reclamation and no undo.

Separately, listing doesn't scale. list_tables in directory (non-manifest) mode does one read_dir(root) to find *.lance dirs, then a per-table read_dir(<name>.lance/) (check_table_status) to detect the nested .lance-deregistered / .lance-reserved markers — i.e. 1 + N object-store requests for N tables. This is a real performance cliff for large directories.

There's already a partial primitive: deregister_table writes a nested .lance-deregistered marker and check_table_status reads it, but it's a dead end — one-way (no restore; register_table is unsupported in directory mode), no purge, no TTL, and drop_table ignores it entirely.

Proposal

1. Move the delete marker to the namespace root (the key change)

Replace the nested marker (<root>/<name>.lance/.lance-deregistered) with a root-level sibling, e.g. <root>/<name>.deleted, whose body is small JSON: { "deleted_at_ms": ..., "ttl_ms": ... }.

Because the object store's read_dir is a single non-recursive listing that returns child "directories" (common prefixes) and direct child files, one read_dir(root) now reveals both the <name>.lance table dirs and the <name>.deleted markers. list_tables filters deleted names in-memory from that single listing → O(1) requests, eliminating the per-table probe. (Today's nested marker is invisible to the root listing, which is why the per-table probe exists.)

2. DROP vs PURGE in the namespace API

  • drop_table becomes soft: atomically create the <name>.deleted marker (stamped with deleted_at/ttl), leave data intact. Reversible.
  • purge_table / purge_tables(Option<Vec<Id>>) (new): the physical remove_dir_all + marker cleanup. None = purge all currently-purgable tables; a list = exactly those. Purge only ever acts on already-soft-deleted tables — never a live one.
  • list_purgable_tables(deleted_before: Option<Timestamp>) (new): returns soft-deleted tables and their deleted_at. The deleted_before cutoff lets a caller apply TTL policy while the namespace owns the (now O(1)) listing/filtering mechanism.
  • table_status(id) (new): returns one of Exists / SoftDeleted{deleted_at} / NotFound. (Deliberately not distinguishing "purged" from "never existed" — that would require retaining tombstones after purge, with their own GC. Purged == NotFound.)

3. Re-create over a soft-deleted table = overwrite

create_table for a name that has a .deleted marker should clear the marker and write with overwrite semantics (preserving lineage as a new version). Today create_table ignores the marker and fails with TableAlreadyExists if data is present.

4. Reads respect the marker

list_tables, describe_table, open/access paths treat a soft-deleted table as NotFound (clean, not an internal error).

5. Purge/revive race guard

A purge that fires at TTL must not delete data that a concurrent re-create just revived. Plain claim files are insufficient (no TTL on the claim → poisoned-lock risk; non-atomic check-then-delete). Proposal: use the marker itself as a compare-and-swap arbiter via conditional writes (PutMode::Update with etag): purge and revive contend on the single marker object; the loser aborts. This needs (a) conditional-write support confirmed across target object stores, and (b) PutMode::Update/etag plumbed through the object-store wrapper (only Create is used today). A dedicated concurrency test is a must — this is the one place a bug means data loss rather than a bounded-stale read.

Open questions (would love your take, @jackye1995)

  1. Marker naming/location. <root>/<name>.deleted vs a reserved prefix vs encoding into the dir name. Anything that keeps the single-listing property.
  2. API shape. purge_tables(Option<list>) (None = all) vs separate purge_table + purge_all_purgable. Where TTL policy lives (caller passes deleted_before vs namespace owns a TTL config).
  3. table_status on the trait — worth adding as a first-class method, or keep it internal?
  4. Conditional-write reliability for the race guard across object stores (incl. self-hosted/MinIO). Fallback when CAS isn't available?
  5. Manifest mode. This issue is about directory (V1) mode. How should the same lifecycle look in manifest mode (where deregister is a manifest-row delete)? Keep them separate, or unify the surface?
  6. Backward compatibility with any existing nested .lance-deregistered markers — migrate, or read both during a transition?

Out of scope (tracked separately)

  • lancedb consumer changes: routing the ListingDatabase root path through the namespace (so root tables get this behavior and the duplicated native scan/remove_dir_all codepath is removed), resolving clone_table (unsupported in the namespace wrapper today), and warm-handle detection in the read-consistency wrapper.
  • Purge scheduling (the periodic list_purgable_tables(now − ttl)purge_tables job) and any deployment integration.

Non-goals

Strong cross-process consistency. After a soft-delete, fresh opens and listings are immediately correct, but already-open/cached handles elsewhere may keep serving the table until they refresh — a bounded, eventual-consistency window consistent with the existing model. The TTL grace window is precisely what prevents in-flight reads from erroring; making "drop" instantly visible everywhere is not a goal here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions