Skip to content

Per-read .option() storage credentials are dropped on path-based reads, causing Azure MSI fallback #665

Description

@sezruby

Summary

When reading a Lance dataset by path with storage credentials supplied via the
DataFrame reader .option(...) API, the credentials are silently dropped before the
dataset is opened. The native object store then falls back to its default credential
chain — for Azure, the IMDS / Managed Identity endpoint — and the read fails on any
environment without a managed identity.

Repro

spark.read.format("lance")
  .option("azure_storage_sas_token", "<sas>")
  .option("azure_storage_account_name", "<account>")
  .load("abfss://<fs>@<account>.dfs.core.windows.net/path/to/ds.lance")
  .show()

No spark.sql.catalog.* configured, and the SAS is not present as an OS env var.

Observed

LanceError(IO): Generic MicrosoftAzure error: Error performing token request:
Error performing GET http://169.254.169.254/metadata/identity/oauth2/token
  ?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com ... after 3 retries
    at org.lance.Dataset.openNative(Native Method)
    at org.lance.spark.utils.Utils$OpenDatasetBuilder.build(Utils.java:...)
    at org.lance.spark.BaseLanceNamespaceSparkCatalog.loadTableFromPath(...)

Root cause

LanceDataSource implements SupportsCatalogOptions. On a path-based read, Spark calls
extractIdentifier(options) and then catalog.loadTable(identifier) — and only the
Identifier is forwarded; the per-read option map is not.

  • extractIdentifier builds new LanceIdentifier(readOptions.getDatasetUri()), keeping
    only the location and discarding the options.
  • loadTableFromPath therefore rebuilds storage options from catalogConfig alone,
    which is empty when credentials were supplied per-read.
  • The resulting credential-less Dataset.open falls through to object_store's default
    Azure credential chain → IMDS/MSI → the error above.

This is not Azure-specific — the same drop affects any per-.option() storage credential
(S3 keys, GCS, etc.) on the path-based read. It is masked if the credential also happens
to be present as an OS environment variable, since lance core reads those as a fallback.

Scope / not covered by existing PRs

Proposed fix

Carry the per-read options on LanceIdentifier and thread them through
loadTableFromPath so they reach the native open, with per-read options overriding
catalog-level storage options. PR to follow.

Environment

  • lance-spark lance-spark-base (reproduced on current main)
  • Spark 3.5, Scala 2.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions