fix: cross-CN shuffle join INSERT...SELECT hang on multi-CN cluster (#24919)#25158
Open
ck89119 wants to merge 2 commits into
Open
fix: cross-CN shuffle join INSERT...SELECT hang on multi-CN cluster (#24919)#25158ck89119 wants to merge 2 commits into
ck89119 wants to merge 2 commits into
Conversation
…atrixorigin#24919) Root cause: newShuffleJoinScopeList leaves a CNs dop join buckets in separate RemoteRun trees while the shuffle dispatch attaches to only the first one. In compileMultiUpdate (used by large INSERT...SELECT), newMergeScope sends each bucket individually -> checkPipeline detects dispatch.LocalRegs pointing to out-of-tree sibling buckets -> converts to local on coordinator -> dispatch runs on wrong CN, mispaired with compile-time cross-CN receiver FromAddr -> remote GetProcByUuid spins / merge WaitingEnd waits forever -> 5M+ rows hang. Fix: add groupShuffleBucketsByCNIfNeeded (gating: multi-CN + cross-CN dispatch present) which merges same-CN shuffle buckets and their nested dispatch into one per-CN send unit via mergeScopesByCN. When the whole group is sent as one RemoteRun unit to its target CN, the dispatch runs on the correct CN, checkPipeline returns true, and the cross-CN receiver handshake completes normally. Noop for single-CN / non-shuffle inserts. Wired into compileInsert and compileMultiUpdate (toWriteS3 + !toWriteS3, 4 call sites). Verified on 2-CN docker cluster: 5Mx2 consecutive runs complete in ~17s with count(*)=5,000,000 and correct data; previously hung forever. Co-Authored-By: Claude <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
- document scopeTreeHasCrossCNDispatch precondition (dispatch is always RootOp) - document that grouping preserves callers existing root operator via AppendChild - extend grouping to the compileInsert S3 sink-scan path: dataScope.MergeRun still sends each bucket individually, so a cross-CN shuffle dispatch there would hit the same convert-to-local hang; group same-CN buckets first Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24919
What this PR does / why we need it:
Fixes a silent deadlock where
INSERT ... SELECTwith cross-CN shuffle hash join (shuffle: range(...)) over large tables (>=5M rows/table) hangs forever on multi-CN clusters. The same statement completes in ~3-6s on a single-node deployment.Root cause
newShuffleJoinScopeListleaves a CN'sdopjoin buckets as independentRemoteRuntrees, while the shuffle dispatch attaches to only the first bucket. WhencompileMultiUpdate(used by largeINSERT...SELECT) sends each bucket individually vianewMergeScope:checkPipelineStandaloneExecutableAtRemotedetects the dispatch'sLocalRegspointing to out-of-tree sibling buckets → correctly returns falseRemoteRunconverts the pipeline to local on the coordinatorFromAddrGetProcByUuidendlessly busy-spin (300s timeout) → idleWaitingEnddeadlock (uncancellablecontext.TODO())Fix
Add
groupShuffleBucketsByCNIfNeeded(gated: multi-CN + cross-CN dispatch subtree present) which reusesmergeScopesByCNto group same-CN shuffle buckets — together with their nested dispatch — into one per-CN send unit. When the whole group is sent via a singleRemoteRunto its target CN:checkPipelineStandaloneExecutableAtRemotereturnstrue(all dispatchLocalRegsare within the same tree)RemoteRunprotocol, dispatch operator, orcheckPipelinelogicThe grouping is a no-op for single-CN / non-shuffle inserts (confirmed by gating UT).
Changes (2 files, +171 lines)
pkg/sql/compile/compile.go (+62):groupShuffleBucketsByCNIfNeeded+scopeTreeHasCrossCNDispatchhelpers, wired intocompileInsertandcompileMultiUpdate(4 call sites:toWriteS3and!toWriteS3paths)pkg/sql/compile/remoterun_test.go (+109): UT reproducing thecheckPipeline=falsepre-fix + verifying per-CN containers returntruepost-fix + gating test (no regressions for single-CN / non-shuffle)Verification
etc/docker-multi-cn-local-disk): 5M rows × 2 consecutive runs complete in ~17s each,count(*)=5,000,000, data fully correct (distinct_id=5,000,000,bad_pad=0,bad_k=0). Previously hung forever.-racepass; new helper coverage 100%;go vetclean;go build ./pkg/sql/...passes🤖 Generated with Claude Code