Fix/parquet opener page index policy#19890
Merged
alamb merged 9 commits intoapache:mainfrom Jan 27, 2026
Merged
Conversation
Contributor
Author
|
Resolved the issues! @kumarUjjawal |
91e0832 to
faffff0
Compare
martin-g
reviewed
Jan 20, 2026
As requested in PR feedback, the regression test for issue apache#19839 has been moved from a dedicated file to the existing page_pruning.rs test file to keep related tests together.
alamb
reviewed
Jan 22, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
Thanks @aviralgarg05 , @kumarUjjawal and @martin-g
This is looking good -- I think we just need to fix the parquet-testing pin and it will be good to go
|
|
||
| // Write parquet WITHOUT page index | ||
| // The default WriterProperties does not write page index, but we set it explicitly | ||
| // to be robust against future changes in defaults as requested by reviewers. |
5b1d1c6 to
b8410d2
Compare
alamb
approved these changes
Jan 23, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
Looks good to me -- thank you @martin-g and @kumarUjjawal
Contributor
|
I took the liberty of fixing a clippy error and pushing to your branch to get a clean CI run |
Contributor
|
Thanks again @aviralgarg05 |
Contributor
|
Thanks all! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
ParquetOpenerfails on files withoutPageIndexmetadata #19839.Rationale for this change
The ParquetOpener was using
ArrowReaderOptions::with_page_index(true), which internally setsPageIndexPolicy::Required. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata.Relaxing this policy to
PageIndexPolicy::Optionalallows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists.What changes are included in this PR?
PageIndexPolicy::Optionalinstead ofRequired.Are these changes tested?
Yes. I have added a dedicated regression test case:
This test writes a Parquet file specifically without page index metadata and verifies that ParquetOpener can read it successfully when
parquet_page_index_pruningis enabled.Are there any user-facing changes?
No. This is a bug fix that improves the robustness of the Parquet reader.