fix(spark): propagate per-read .option() storage credentials into path-based loadTable#666
Open
sezruby wants to merge 2 commits into
Open
fix(spark): propagate per-read .option() storage credentials into path-based loadTable#666sezruby wants to merge 2 commits into
sezruby wants to merge 2 commits into
Conversation
…h-based loadTable Path-based reads that pass storage credentials via the DataFrame .option(...) API (e.g. an Azure SAS token) failed with an Azure Managed Identity (IMDS 169.254.169.254) token error. LanceDataSource implements SupportsCatalogOptions, so Spark calls extractIdentifier(options) -> catalog.loadTable(identifier) and forwards only the Identifier. LanceIdentifier kept just the location, so loadTableFromPath rebuilt storage options from catalogConfig alone -- empty for per-read credentials -- and the native open fell back to object_store's default Azure credential chain (MSI). Carry the per-read options on LanceIdentifier and thread them through loadTableFromPath via a new Utils.createReadOptions overload, seeded before withCatalogDefaults so per-read options override catalog storage options. Fixes lance-format#665 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ath key
Address review feedback:
- Apply the same per-read .option() credential propagation to tableExistsAtPath,
which had the identical drop -> Azure MSI fallback bug as loadTableFromPath.
Extract a shared getPathOptions(Identifier) helper used by both.
- Strip the dataset-URI key ("path") from per-read options before seeding the
read options, so it does not leak into the native storage_options map
(related to lance-format#520).
- Document the per-read-overrides-catalog precedence on the LanceIdentifier
constructor; assert the "path" key is not present in storage options; warn in
the integration test against reverting to AZURE_STORAGE_SAS_TOKEN (would defeat
the regression guard, see lance-format#665).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Path-based reads that pass storage credentials via the DataFrame
.option(...)API(e.g. an Azure SAS token) fail with an Azure Managed Identity (IMDS
169.254.169.254)token error. Fixes #665.
LanceDataSourceimplementsSupportsCatalogOptions, so Spark callsextractIdentifier(options)→catalog.loadTable(identifier)and forwards only theIdentifier.LanceIdentifierkept just the location, so the path-based opens rebuiltstorage options from
catalogConfigalone — empty for per-read credentials — and thenative open fell back to object_store's default Azure credential chain (MSI).
Change
LanceIdentifiergains an optional immutableoptionsmap (back-compat 1-argconstructor retained). On the load path these options override catalog-level
storage.*defaults on conflict; catalog options still fill gaps.LanceDataSource.extractIdentifiercaptures the reader options onto the identifier.Utils.createReadOptionsgets an overload takingpathOptions, seeded beforewithCatalogDefaultsto get that precedence. The dataset-URI key (path) is strippedfrom the seed so it does not leak into the native
storage_optionsmap (related tofix(spark): strip recognized typed options from Rust storage_options map #520, which removes the other recognized typed keys).
getPathOptions(Identifier)helper:loadTableFromPath(the time-travel resolution open and the main open), andtableExistsAtPath, which had the identical drop → MSI bug and is on theCREATE TABLE IF NOT EXISTS/ overwrite path.Credential precedence on path-based access: reader
.option(...)> catalogstorage.*defaults.Tests
LanceIdentifierOptionsTest(unit, base module): identifier carries optionsimmutably; null → empty; per-read options reach
getStorageOptions(); per-readoverrides catalog default while catalog default still fills gaps; the
pathkey isnot present in the storage options.
AbfssSasReadIntegrationTest(3.5 module,@EnabledIfEnvironmentVariable-gated so CIskips it without creds): reads a real
abfss://dataset with the SAS supplied only via.option(...). Verified locally — fails onmainwith the MSI error(
GET http://169.254.169.254/.../oauth2/token), passes with this change. The testreads the SAS from a neutral env var (
LANCE_IT_SAS) so lance core'sAZURE_STORAGE_*env fallback can't mask the connector behavior. Relates to Run test against Azure #241.
Notes
functional overlap. If fix(catalog): propagate namespace storage options into read options #522 merges first, the path-based call sites become
createPathBasedReadOptions(...); this patch rebases by adding thepathOptionsoverload there instead — trivial.
🤖 Generated with Claude Code