Skip to content

fix: cross-CN shuffle join INSERT...SELECT hang on multi-CN cluster (#24919)#25158

Open
ck89119 wants to merge 2 commits into
matrixorigin:3.0-devfrom
ck89119:fix-cross-cn-shuffle-hang-3.0-dev
Open

fix: cross-CN shuffle join INSERT...SELECT hang on multi-CN cluster (#24919)#25158
ck89119 wants to merge 2 commits into
matrixorigin:3.0-devfrom
ck89119:fix-cross-cn-shuffle-hang-3.0-dev

Conversation

@ck89119

@ck89119 ck89119 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24919

What this PR does / why we need it:

Fixes a silent deadlock where INSERT ... SELECT with cross-CN shuffle hash join (shuffle: range(...)) over large tables (>=5M rows/table) hangs forever on multi-CN clusters. The same statement completes in ~3-6s on a single-node deployment.

Root cause

newShuffleJoinScopeList leaves a CN's dop join buckets as independent RemoteRun trees, while the shuffle dispatch attaches to only the first bucket. When compileMultiUpdate (used by large INSERT...SELECT) sends each bucket individually via newMergeScope:

  1. checkPipelineStandaloneExecutableAtRemote detects the dispatch's LocalRegs pointing to out-of-tree sibling buckets → correctly returns false
  2. RemoteRun converts the pipeline to local on the coordinator
  3. The dispatch runs on the coordinator instead of its compile-time CN, mispaired with the cross-CN receiver's FromAddr
  4. Remote receivers GetProcByUuid endlessly busy-spin (300s timeout) → idle WaitingEnd deadlock (uncancellable context.TODO())

Fix

Add groupShuffleBucketsByCNIfNeeded (gated: multi-CN + cross-CN dispatch subtree present) which reuses mergeScopesByCN to group same-CN shuffle buckets — together with their nested dispatch — into one per-CN send unit. When the whole group is sent via a single RemoteRun to its target CN:

  • checkPipelineStandaloneExecutableAtRemote returns true (all dispatch LocalRegs are within the same tree)
  • The dispatch runs on the correct CN, completing the cross-CN receiver handshake
  • No changes to the RemoteRun protocol, dispatch operator, or checkPipeline logic

The grouping is a no-op for single-CN / non-shuffle inserts (confirmed by gating UT).

Changes (2 files, +171 lines)

  • pkg/sql/compile/compile.go (+62): groupShuffleBucketsByCNIfNeeded + scopeTreeHasCrossCNDispatch helpers, wired into compileInsert and compileMultiUpdate (4 call sites: toWriteS3 and !toWriteS3 paths)
  • pkg/sql/compile/remoterun_test.go (+109): UT reproducing the checkPipeline=false pre-fix + verifying per-CN containers return true post-fix + gating test (no regressions for single-CN / non-shuffle)

Verification

  • 2-CN docker cluster (etc/docker-multi-cn-local-disk): 5M rows × 2 consecutive runs complete in ~17s each, count(*)=5,000,000, data fully correct (distinct_id=5,000,000, bad_pad=0, bad_k=0). Previously hung forever.
  • Unit tests: -race pass; new helper coverage 100%; go vet clean; go build ./pkg/sql/... passes
  • Gating: single-CN and non-shuffle paths unaffected (UT confirms no-op when no cross-CN dispatch)

🤖 Generated with Claude Code

…atrixorigin#24919)

Root cause: newShuffleJoinScopeList leaves a CNs dop join buckets in
separate RemoteRun trees while the shuffle dispatch attaches to only the
first one. In compileMultiUpdate (used by large INSERT...SELECT),
newMergeScope sends each bucket individually -> checkPipeline detects
dispatch.LocalRegs pointing to out-of-tree sibling buckets -> converts to
local on coordinator -> dispatch runs on wrong CN, mispaired with
compile-time cross-CN receiver FromAddr -> remote GetProcByUuid spins /
merge WaitingEnd waits forever -> 5M+ rows hang.

Fix: add groupShuffleBucketsByCNIfNeeded (gating: multi-CN + cross-CN
dispatch present) which merges same-CN shuffle buckets and their nested
dispatch into one per-CN send unit via mergeScopesByCN. When the whole
group is sent as one RemoteRun unit to its target CN, the dispatch runs
on the correct CN, checkPipeline returns true, and the cross-CN receiver
handshake completes normally. Noop for single-CN / non-shuffle inserts.

Wired into compileInsert and compileMultiUpdate (toWriteS3 + !toWriteS3,
4 call sites). Verified on 2-CN docker cluster: 5Mx2 consecutive runs
complete in ~17s with count(*)=5,000,000 and correct data; previously
hung forever.

Co-Authored-By: Claude <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

- document scopeTreeHasCrossCNDispatch precondition (dispatch is always RootOp)
- document that grouping preserves callers existing root operator via AppendChild
- extend grouping to the compileInsert S3 sink-scan path: dataScope.MergeRun
  still sends each bucket individually, so a cross-CN shuffle dispatch there
  would hit the same convert-to-local hang; group same-CN buckets first

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants