feat: parallelize parquet load by row group#24808
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
fixed |
Plan Parquet LOAD around file and row-group fanout, carry shard metadata through ExternalScan, and keep S3 prefetch behavior bounded for sharded readers. Harden unsupported option handling, add Parquet profile stats, and cover compile/runtime/BVT regressions for schema, conversion, rollback, and parallel-load paths.
Add the missing DATE32 Parquet resource used by load_data_parquet.sql so the 4.0-dev BVT can load the cherry-picked date32-to-DATETIME case instead of failing on a missing file.
## What type of PR is this? - [ ] API-change - [x] BUG - [ ] Improvement - [ ] Documentation - [ ] Feature - [ ] Test and CI - [ ] Code Refactoring ## Which issue(s) this PR fixes: issue matrixorigin#24846 ## What this PR does / why we need it: This fixes LOAD external scan stats so large LOAD jobs keep row/cardinality semantics for `Cost`, `Outcnt`, `TableCnt`, and `BlockNum`, while preserving `Cost * Rowsize` as the input-size hint used by external scan parallel sizing. Previously the LOAD stats used input bytes as `Cost` with `Rowsize=1` and forced `BlockNum=1` / `TableCnt=1`. That can make large CSV/TBL LOAD choose `AP_ONECN` instead of the expected multi-CN AP path. Tests: ```sh CGO_CFLAGS="-I$(pwd)/cgo -I$(pwd)/thirdparties/install/include" go test ./pkg/sql/plan -count=1 ``` (cherry picked from commit 4426ca7)
XuPeng-SH
left a comment
There was a problem hiding this comment.
I re-checked the latest head. The earlier blockers around row-group planning and the small-file ParallelLoadRequested regression look addressed, but there is still one remaining high-severity issue in the empty-file path.
In the new row-group fanout path, empty parquet files are effectively dropped from execution:
readLoadParquetRowGroupMetadata()only emits metadata rows for actual row groupssplitParquetRowGroupShards()only carries files into a shard when they own at least one row groupcompileExternScanParquetRowGroupFanout()then executes only those sharded files
That means an empty parquet file matched by the load pattern never reaches newParquetHandler() once the plan has switched to row-group fanout.
But the empty-file compatibility check still lives in pkg/sql/colexec/external/parquet.go:63-81: that is where an empty parquet with the wrong column count is supposed to fail. So with the current fanout logic, a wildcard load that matches {empty_bad_schema.parquet, data.parquet(with row groups)} can silently skip the empty file and lose the existing schema/column-count validation instead of rejecting the load.
The current tests actually show this behavior clearly:
pkg/sql/compile/scope_test.go:885-950asserts the fanout scopes contain only the non-empty filepkg/sql/colexec/external/parquet_test.go:3055-3093shows the mismatch is only caught when the empty file is actually opened bynewParquetHandler()
Concrete suggestions
- Validate empty parquet files during footer enumeration before discarding them from row-group metadata.
- Or, carry matched empty files into at least one execution shard so
newParquetHandler()still opens them and applies the existing empty-file checks. - Add one regression test for the mixed case explicitly: load pattern matches both
- an empty parquet with schema/column-count mismatch, and
- a non-empty multi-row-group parquet,
and the whole load must still fail.
Non-blocking follow-up only: I still think an end-to-end case that actually forces the row-group-parallel execution path on a threshold-crossing parquet file would be a useful extra proof after this semantic hole is fixed.
fixed |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Re-checked the latest head. The previous blockers around row-group planning and the small-file ParallelLoadRequested regression are fixed, and the remaining empty-parquet semantic hole is now closed as well: empty parquet files are validated during footer enumeration before being dropped from row-group fanout, with a mixed empty+non-empty regression test to prove it. I do not see a remaining correctness issue that should keep this blocked.
|
Queued — the merge queue status continues in this comment ↓. |
Merge Queue Status
This pull request spent 4 hours 1 minute 7 seconds in the queue, with no time running CI. Waiting for any of
All conditions
ReasonThe merge conditions cannot be satisfied: the checks did not pass within the checks timeout of 4 hours HintYou may have to fix your CI before adding the pull request to the queue again. If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again. Tick the box to put this pull request back in the merge queue (same as
|
What type of PR is this?
Which issue(s) this PR fixes:
issue #24254
What this PR does / why we need it:
Plan Parquet LOAD around file and row-group fanout, carry shard metadata through ExternalScan, and keep S3 prefetch behavior bounded for sharded readers.
Harden unsupported option handling, add Parquet profile stats, and cover compile/runtime/BVT regressions for schema, conversion, rollback, and parallel-load paths.