Skip to content

fix(spark): propagate per-read .option() storage credentials into path-based loadTable#666

Open
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:fix/abfss-sas-option-propagation
Open

fix(spark): propagate per-read .option() storage credentials into path-based loadTable#666
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:fix/abfss-sas-option-propagation

Conversation

@sezruby

@sezruby sezruby commented Jun 30, 2026

Copy link
Copy Markdown

Problem

Path-based reads that pass storage credentials via the DataFrame .option(...) API
(e.g. an Azure SAS token) fail with an Azure Managed Identity (IMDS 169.254.169.254)
token error. Fixes #665.

LanceDataSource implements SupportsCatalogOptions, so Spark calls
extractIdentifier(options)catalog.loadTable(identifier) and forwards only the
Identifier. LanceIdentifier kept just the location, so the path-based opens rebuilt
storage options from catalogConfig alone — empty for per-read credentials — and the
native open fell back to object_store's default Azure credential chain (MSI).

Change

  • LanceIdentifier gains an optional immutable options map (back-compat 1-arg
    constructor retained). On the load path these options override catalog-level
    storage.* defaults on conflict; catalog options still fill gaps.
  • LanceDataSource.extractIdentifier captures the reader options onto the identifier.
  • Utils.createReadOptions gets an overload taking pathOptions, seeded before
    withCatalogDefaults to get that precedence. The dataset-URI key (path) is stripped
    from the seed so it does not leak into the native storage_options map (related to
    fix(spark): strip recognized typed options from Rust storage_options map #520, which removes the other recognized typed keys).
  • Both path-based opens are covered, via a shared getPathOptions(Identifier) helper:
    • loadTableFromPath (the time-travel resolution open and the main open), and
    • tableExistsAtPath, which had the identical drop → MSI bug and is on the
      CREATE TABLE IF NOT EXISTS / overwrite path.

Credential precedence on path-based access: reader .option(...) > catalog
storage.* defaults
.

Tests

  • LanceIdentifierOptionsTest (unit, base module): identifier carries options
    immutably; null → empty; per-read options reach getStorageOptions(); per-read
    overrides catalog default while catalog default still fills gaps; the path key is
    not present in the storage options.
  • AbfssSasReadIntegrationTest (3.5 module, @EnabledIfEnvironmentVariable-gated so CI
    skips it without creds): reads a real abfss:// dataset with the SAS supplied only via
    .option(...). Verified locally — fails on main with the MSI error
    (GET http://169.254.169.254/.../oauth2/token), passes with this change.
    The test
    reads the SAS from a neutral env var (LANCE_IT_SAS) so lance core's AZURE_STORAGE_*
    env fallback can't mask the connector behavior. Relates to Run test against Azure #241.

Notes

🤖 Generated with Claude Code

…h-based loadTable

Path-based reads that pass storage credentials via the DataFrame .option(...)
API (e.g. an Azure SAS token) failed with an Azure Managed Identity (IMDS
169.254.169.254) token error.

LanceDataSource implements SupportsCatalogOptions, so Spark calls
extractIdentifier(options) -> catalog.loadTable(identifier) and forwards only
the Identifier. LanceIdentifier kept just the location, so loadTableFromPath
rebuilt storage options from catalogConfig alone -- empty for per-read
credentials -- and the native open fell back to object_store's default Azure
credential chain (MSI).

Carry the per-read options on LanceIdentifier and thread them through
loadTableFromPath via a new Utils.createReadOptions overload, seeded before
withCatalogDefaults so per-read options override catalog storage options.

Fixes lance-format#665

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the bug Something isn't working label Jun 30, 2026
…ath key

Address review feedback:
- Apply the same per-read .option() credential propagation to tableExistsAtPath,
  which had the identical drop -> Azure MSI fallback bug as loadTableFromPath.
  Extract a shared getPathOptions(Identifier) helper used by both.
- Strip the dataset-URI key ("path") from per-read options before seeding the
  read options, so it does not leak into the native storage_options map
  (related to lance-format#520).
- Document the per-read-overrides-catalog precedence on the LanceIdentifier
  constructor; assert the "path" key is not present in storage options; warn in
  the integration test against reverting to AZURE_STORAGE_SAS_TOKEN (would defeat
  the regression guard, see lance-format#665).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sezruby sezruby marked this pull request as ready for review June 30, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-read .option() storage credentials are dropped on path-based reads, causing Azure MSI fallback

1 participant