mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash by ko3n1g · Pull Request #4299 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-06-11T06:36:24Z

Claude summary

Mirror of #4155 by @Eisenhower — copied into the upstream repo so the full CI pipeline runs natively (cross-fork PRs cannot trigger it).

Original PR: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash #4155
Author: @Eisenhower
Source: Eisenhower/Megatron-Bridge-MiMo:fix/mimo-v2-flash-value-scale

Commits are copied verbatim with authorship preserved. Review and discussion remain on #4155.

The attention_value_scale field is read from the HF config and stored on MiMoV2FlashTEDotProductAttention as `self._attention_value_scale`, but never applied to the value tensor in the forward path. As a result the factor (0.707 in the public XiaomiMiMo/MiMo-V2-Flash checkpoint) is silently dropped, leaving the attention output scaled by ~1.414x relative to the reference implementation. The scale cannot be folded into `softmax_scale` (softmax is non-linear in its argument) and the linear_proj weight is mapped 1:1 from HF `o_proj.weight`, so the only correct place to apply it is on V (or equivalently on the attention output). This mirrors the HF reference modeling code and vLLM's MiMo-V2 model runner, both of which multiply V by `attention_value_scale` before the attention kernel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Leo <hitler594588@163.com>

Adds unit tests for the MiMoV2FlashTEDotProductAttention.forward override that applies attention_value_scale to V before the attention kernel. Tests verify: - Scale is applied (V passed to super() equals input * scale) - None scale leaves V passing through unchanged (same object) - Caller's V buffer is not mutated in place - Q and K are forwarded untouched - Extra kwargs are forwarded to super() - super()'s return value is propagated The tests bypass TransformerEngine entirely by using object.__new__ to build an instance without invoking __init__, then patching the parent TEDotProductAttention.forward to capture what gets forwarded. This keeps the tests CPU-only and dependency-free. Closes the codecov/patch coverage gap on the two-line fix in src/megatron/bridge/models/mimo_v2_flash/modeling_mimo_v2_flash.py. Signed-off-by: Leo <hitler594588@163.com>

copy-pr-bot · 2026-06-11T06:36:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-06-11T06:38:16Z

Light Code Review

Fix (modeling_mimo_v2_flash.py): Clean and correct. The _attention_value_scale was already read from config in init (line 256) but never applied in forward() — this 2-line fix closes that gap. value = value * self._attention_value_scale correctly creates a new tensor (no in-place mutation), which is safe for autograd. The MiMoV2FlashMTPTEDotProductAttention subclass inherits this fix automatically since it does not override forward().

Tests (test_modeling_mimo_v2_flash.py): Thorough unit test suite that covers the key cases: scale applied, various scale values, None passthrough (identity check via is), no in-place mutation, Q/K untouched, kwargs forwarding, and return value propagation. The object.new + patched super approach is a clean way to avoid the TE/CUDA dependency in CPU-only tests.

No issues found — LGTM.

Suggested test cases: No perf tests impacted.

ko3n1g · 2026-06-11T06:46:17Z

/ok to test c797336

Replace the CPU-only mock-based tests with proper GPU tests that construct a real MiMoV2FlashTEDotProductAttention instance and run actual TE attention kernels. The previous tests failed in CI because autospec=True enforced the parent TEDotProductAttention.forward signature (which lacks **kwargs), rejecting the extra_kwarg test argument. The new tests: - Build a real module on GPU with TransformerConfig + TE - Intercept the value tensor reaching the parent forward to verify scaling - Cover: scale applied, parametrized scales, None passthrough, no in-place mutation, Q/K unchanged, output shape, scale sensitivity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Eisenhower <eisenhower@users.noreply.github.com>

yaoyu-33 · 2026-06-18T03:30:18Z

/ok to test 650e37b

Eisenhower and others added 3 commits June 4, 2026 20:42

Merge branch 'main' into fix/mimo-v2-flash-value-scale

c797336

copy-pr-bot Bot temporarily deployed to public June 11, 2026 06:46 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 06:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 06:56 Inactive

yaoyu-33 added area:model Model implementations and HF bridge logic bug Something isn't working community-request needs-review PR is ready for code review and waiting on a reviewer labels Jun 11, 2026

copy-pr-bot Bot temporarily deployed to public June 11, 2026 07:18 Inactive

github-actions Bot removed the community-request label Jun 18, 2026

copy-pr-bot Bot temporarily deployed to public June 18, 2026 03:30 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 03:40 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 04:00 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299

mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299
ko3n1g wants to merge 4 commits into
mainfrom
ko3n1g/mirror/pr-4155

ko3n1g commented Jun 11, 2026

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

ko3n1g commented Jun 11, 2026

Uh oh!

yaoyu-33 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ko3n1g commented Jun 11, 2026

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Light Code Review

Uh oh!

ko3n1g commented Jun 11, 2026

Uh oh!

yaoyu-33 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants