Skip to content

mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299

Open
ko3n1g wants to merge 4 commits into
mainfrom
ko3n1g/mirror/pr-4155
Open

mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299
ko3n1g wants to merge 4 commits into
mainfrom
ko3n1g/mirror/pr-4155

Conversation

@ko3n1g

@ko3n1g ko3n1g commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
Claude summary

Mirror of #4155 by @Eisenhower — copied into the upstream repo so the full CI pipeline runs natively (cross-fork PRs cannot trigger it).

Commits are copied verbatim with authorship preserved. Review and discussion remain on #4155.

Eisenhower and others added 3 commits June 4, 2026 20:42
The attention_value_scale field is read from the HF config and stored on
MiMoV2FlashTEDotProductAttention as `self._attention_value_scale`, but
never applied to the value tensor in the forward path. As a result the
factor (0.707 in the public XiaomiMiMo/MiMo-V2-Flash checkpoint) is
silently dropped, leaving the attention output scaled by ~1.414x
relative to the reference implementation.

The scale cannot be folded into `softmax_scale` (softmax is non-linear
in its argument) and the linear_proj weight is mapped 1:1 from HF
`o_proj.weight`, so the only correct place to apply it is on V (or
equivalently on the attention output).

This mirrors the HF reference modeling code and vLLM's MiMo-V2 model
runner, both of which multiply V by `attention_value_scale` before the
attention kernel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Leo <hitler594588@163.com>
Adds unit tests for the MiMoV2FlashTEDotProductAttention.forward
override that applies attention_value_scale to V before the attention
kernel. Tests verify:

- Scale is applied (V passed to super() equals input * scale)
- None scale leaves V passing through unchanged (same object)
- Caller's V buffer is not mutated in place
- Q and K are forwarded untouched
- Extra kwargs are forwarded to super()
- super()'s return value is propagated

The tests bypass TransformerEngine entirely by using object.__new__ to
build an instance without invoking __init__, then patching the parent
TEDotProductAttention.forward to capture what gets forwarded. This
keeps the tests CPU-only and dependency-free.

Closes the codecov/patch coverage gap on the two-line fix in
src/megatron/bridge/models/mimo_v2_flash/modeling_mimo_v2_flash.py.

Signed-off-by: Leo <hitler594588@163.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Light Code Review

Fix (modeling_mimo_v2_flash.py): Clean and correct. The _attention_value_scale was already read from config in init (line 256) but never applied in forward() — this 2-line fix closes that gap. value = value * self._attention_value_scale correctly creates a new tensor (no in-place mutation), which is safe for autograd. The MiMoV2FlashMTPTEDotProductAttention subclass inherits this fix automatically since it does not override forward().

Tests (test_modeling_mimo_v2_flash.py): Thorough unit test suite that covers the key cases: scale applied, various scale values, None passthrough (identity check via is), no in-place mutation, Q/K untouched, kwargs forwarding, and return value propagation. The object.new + patched super approach is a clean way to avoid the TE/CUDA dependency in CPU-only tests.

No issues found — LGTM.

Suggested test cases: No perf tests impacted.

@ko3n1g

ko3n1g commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test c797336

@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic bug Something isn't working community-request needs-review PR is ready for code review and waiting on a reviewer labels Jun 11, 2026
Replace the CPU-only mock-based tests with proper GPU tests that
construct a real MiMoV2FlashTEDotProductAttention instance and run
actual TE attention kernels.

The previous tests failed in CI because autospec=True enforced the
parent TEDotProductAttention.forward signature (which lacks **kwargs),
rejecting the extra_kwarg test argument.

The new tests:
- Build a real module on GPU with TransformerConfig + TE
- Intercept the value tensor reaching the parent forward to verify scaling
- Cover: scale applied, parametrized scales, None passthrough,
  no in-place mutation, Q/K unchanged, output shape, scale sensitivity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Eisenhower <eisenhower@users.noreply.github.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test 650e37b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants