Skip to content

fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace#4456

Open
windreamer wants to merge 1 commit intoInternLM:mainfrom
windreamer:fix/guided-decoding-dimension-mismatch
Open

fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace#4456
windreamer wants to merge 1 commit intoInternLM:mainfrom
windreamer:fix/guided-decoding-dimension-mismatch

Conversation

@windreamer
Copy link
Collaborator

The bug occurs when batch contains requests that have finished generation but are still waiting in the batch for synchronization. In this case, generation_size (number of requests still generating) is less than matchers.size() (total requests in batch).

Root cause:

  • In generation.cc Forward(), logits is sliced to match generation_size: env.produce("logits", logits.slice(0, gs));
  • But in guided_decoding.cc ApplyMask(), bitmask was sliced using d.matchers.size() instead of the actual logits batch size.

This causes TM_CHECK(logits_shape.first == bitmask_shape.first) to fail in apply_token_bitmask_inplace_cuda.cu when the dimensions don't match.

fix: #4453

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Turbomind guided-decoding crash caused by a batch-dimension mismatch between sliced logits (only active/generating requests) and the guided-decoding bitmask (previously sliced using the full batch request count).

Changes:

  • Update GuidedDecoding::ApplyMask() to slice the bitmask using logits.shape(0) rather than d.matchers.size().
  • Add clarifying comments explaining why logits.shape(0) is the correct dimension to use when some requests in the batch are no longer generating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@windreamer windreamer changed the title fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace [DO NOT MERGE] fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace Mar 26, 2026
@windreamer windreamer force-pushed the fix/guided-decoding-dimension-mismatch branch 5 times, most recently from 8fe41f0 to 20de205 Compare March 26, 2026 06:31
@windreamer windreamer changed the title [DO NOT MERGE] fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace Mar 26, 2026
The bug occurs when batch contains requests that have finished generation but
are still waiting in the batch for synchronization. In this case,
generation_size (number of requests still generating) is less than
matchers.size() (total requests in batch).

Root cause:
- In generation.cc Forward(), logits is sliced to match generation_size:
  env.produce("logits", logits.slice(0, gs));
- But in guided_decoding.cc ApplyMask(), bitmask was sliced using
  d.matchers.size() instead of the actual logits batch size.

This causes TM_CHECK(logits_shape.first == bitmask_shape.first) to fail
in apply_token_bitmask_inplace_cuda.cu when the dimensions don't match.

Fix: Use logits.shape(0) instead of d.matchers.size() to ensure the
bitmask has the same batch dimension as logits.

Co-authored-by: openhands <openhands@all-hands.dev>
@windreamer windreamer force-pushed the fix/guided-decoding-dimension-mismatch branch from 20de205 to 270e541 Compare March 26, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Turbomind random crash when handle batch request

2 participants