Skip to content

fix(goosefs): decouple OpenDAL root from URL path to avoid ObjectStore cache-key collision#7612

Open
XuQianJin-Stars wants to merge 1 commit into
lance-format:mainfrom
XuQianJin-Stars:fix/goosefs-object-store-cache-key-collision
Open

fix(goosefs): decouple OpenDAL root from URL path to avoid ObjectStore cache-key collision#7612
XuQianJin-Stars wants to merge 1 commit into
lance-format:mainfrom
XuQianJin-Stars:fix/goosefs-object-store-cache-key-collision

Conversation

@XuQianJin-Stars

Copy link
Copy Markdown
Contributor

Summary

The GooseFS ObjectStore provider used to bake the URL path into OpenDAL's
root and always return an empty extract_path. Because ObjectStoreRegistry
keys its cache on the provider prefix (goosefs$host:port), two different
dataset URLs under the same master resolved to the same cached
ObjectStore — with a root locked to whichever URL opened it first. The
second open reused that store and, since extract_path was empty, the read
went to the first dataset's bytes and schema.

This PR aligns the GooseFS provider with the S3/GCS/OSS model.

Reproducer (before this fix)

COPY (SELECT range::BIGINT AS col_a FROM range(3))
  TO 'goosefs://host:9200/repro/a.lance' (FORMAT lance);

COPY (SELECT range::BIGINT AS col_b1,
             ('row_' || range)::VARCHAR AS col_b2 FROM range(5))
  TO 'goosefs://host:9200/repro/b.lance' (FORMAT lance);

SELECT * FROM 'goosefs://host:9200/repro/a.lance';   -- OK: 3 rows, col_a
SELECT * FROM 'goosefs://host:9200/repro/b.lance';
-- BUG: returns 3 rows of col_a (a.lance's data & schema), instead of
--      5 rows of (col_b1, col_b2).

Root cause

rust/lance-io/src/object_store/providers/goosefs.rs had three coupled
choices that together broke the registry cache invariants:

  1. new_store set root = base_path.path() — a per-URL root.
  2. extract_path unconditionally returned Path::from(""), "trusting"
    the root to fully identify the object.
  3. calculate_object_store_prefix returned only goosefs$host:port,
    so the URL path did not participate in the cache key.

ObjectStoreRegistry::get_store (rust/lance-io/src/object_store/providers.rs)
keys active_stores on (prefix, params) and holds Weak<ObjectStore>.
As long as any live Arc<ObjectStore> for a.lance exists, opening
b.lance is a cache hit that returns the store rooted at a.lance,
and the empty key from extract_path reads the wrong dataset.

Fix (S3-style path model)

  • root is cluster-wide, not per-URL. Defaults to /, overridable
    via storage option goosefs_root or env var GOOSEFS_ROOT. All
    datasets under the same master share a single OpenDAL Operator.
  • extract_path returns the URL path (percent-decoded via
    Path::from_url_path) as the per-request object key. OpenDAL's
    build_rooted_abs_path(root, key) reconstructs the absolute GooseFS
    path exactly as before.
  • calculate_object_store_prefix stays scoped to goosefs$host:port
    in the default case, and folds a custom root in as goosefs$host:port#/root
    so different roots remain isolated in the cache.

Net effect: a.lance and b.lance share one cached Operator (the
same reuse benefit S3 already gets), and correctness comes from the
distinct object keys on each request.

Semantic parity with other providers

goosefs://host:9200/repro/a.lance
├── prefix        "goosefs$host:9200"        ← cache key (shared)
├── operator root "/"                        ← stable across datasets
└── extract_path  "repro/a.lance"            ← per-request key

This mirrors aws.rs / gcp.rs / oss.rs, which use the default trait
implementations of extract_path and calculate_object_store_prefix.

Tests

Added focused regression tests in the provider (all passing):

  • test_prefix_shared_across_datasets_same_master — same master, two
    dataset URLs share one cache prefix but yield distinct object
    keys. This is the direct counter-assertion to the bug.
  • test_prefix_isolated_across_masters — different masters must not
    share a cache entry.
  • test_prefix_includes_custom_root — a custom goosefs_root is
    reflected in the prefix.
  • test_goosefs_extract_path_percent_decoded — percent-encoded segments
    (e.g. %20) are decoded exactly once, guarding against double-encoding
    when the object store client re-encodes for transport.

Verification

  • cargo fmt -p lance-io -- --check → clean
  • cargo clippy -p lance-io --features goosefs --all-targets -- -D warnings → 0 warnings
  • cargo test -p lance-io --features goosefs --lib197 passed / 0 failed

Compatibility

  • URL scheme, storage options, and the on-wire GooseFS paths are unchanged.
  • Users who previously relied on the (undocumented) side effect of the
    URL path becoming the OpenDAL root need no change — the same absolute
    path is still resolved via root="/" + relative key.
  • New optional knob: goosefs_root / GOOSEFS_ROOT for tenants that want
    to sandbox an Operator to a subtree. Different values get isolated
    cache entries.

@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer bug Something isn't working labels Jul 3, 2026
…e cache key collision

Previously GooseFsStoreProvider baked the URL path into the OpenDAL
Operator root and also into the ObjectStoreRegistry cache prefix. That
made two datasets under the same master (e.g. goosefs://host:9200/a.lance
and goosefs://host:9200/b.lance) collide in the registry cache: the
second open reused the first Operator whose root still pointed at the
first dataset, so queries on B silently returned A's data.

Rework the provider to follow the S3-style model already used by tos/oss:

- Operator root is cluster-wide (default "/", overridable via
  goosefs_root storage option or GOOSEFS_ROOT env var), never per-URL.
- extract_path returns the URL path (percent-decoded) as the per-request
  object key, so a single shared Operator can route requests to the
  correct dataset.
- calculate_object_store_prefix keys the registry on scheme+authority
  (plus the custom root when non-default), so datasets under the same
  master intentionally share one Operator without cross-contamination.

Add regression tests covering:
- extract_path for basic, root, deep and percent-encoded URLs;
- prefix sharing across datasets on the same master;
- prefix isolation across masters;
- prefix inclusion of a custom goosefs_root;
- master-address resolution from URL, default port and storage options;
- root resolution defaults and overrides.
@XuQianJin-Stars XuQianJin-Stars force-pushed the fix/goosefs-object-store-cache-key-collision branch from 1e76390 to 0c59f54 Compare July 3, 2026 09:25
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.89041% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ust/lance-io/src/object_store/providers/goosefs.rs 95.89% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant