mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299
mirror: [model] fix: apply attention_value_scale to V in MiMo-V2-Flash#4299ko3n1g wants to merge 4 commits into
Conversation
The attention_value_scale field is read from the HF config and stored on MiMoV2FlashTEDotProductAttention as `self._attention_value_scale`, but never applied to the value tensor in the forward path. As a result the factor (0.707 in the public XiaomiMiMo/MiMo-V2-Flash checkpoint) is silently dropped, leaving the attention output scaled by ~1.414x relative to the reference implementation. The scale cannot be folded into `softmax_scale` (softmax is non-linear in its argument) and the linear_proj weight is mapped 1:1 from HF `o_proj.weight`, so the only correct place to apply it is on V (or equivalently on the attention output). This mirrors the HF reference modeling code and vLLM's MiMo-V2 model runner, both of which multiply V by `attention_value_scale` before the attention kernel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Leo <hitler594588@163.com>
Adds unit tests for the MiMoV2FlashTEDotProductAttention.forward override that applies attention_value_scale to V before the attention kernel. Tests verify: - Scale is applied (V passed to super() equals input * scale) - None scale leaves V passing through unchanged (same object) - Caller's V buffer is not mutated in place - Q and K are forwarded untouched - Extra kwargs are forwarded to super() - super()'s return value is propagated The tests bypass TransformerEngine entirely by using object.__new__ to build an instance without invoking __init__, then patching the parent TEDotProductAttention.forward to capture what gets forwarded. This keeps the tests CPU-only and dependency-free. Closes the codecov/patch coverage gap on the two-line fix in src/megatron/bridge/models/mimo_v2_flash/modeling_mimo_v2_flash.py. Signed-off-by: Leo <hitler594588@163.com>
Light Code ReviewFix (modeling_mimo_v2_flash.py): Clean and correct. The _attention_value_scale was already read from config in init (line 256) but never applied in forward() — this 2-line fix closes that gap. value = value * self._attention_value_scale correctly creates a new tensor (no in-place mutation), which is safe for autograd. The MiMoV2FlashMTPTEDotProductAttention subclass inherits this fix automatically since it does not override forward(). Tests (test_modeling_mimo_v2_flash.py): Thorough unit test suite that covers the key cases: scale applied, various scale values, None passthrough (identity check via is), no in-place mutation, Q/K untouched, kwargs forwarding, and return value propagation. The object.new + patched super approach is a clean way to avoid the TE/CUDA dependency in CPU-only tests. No issues found — LGTM. Suggested test cases: No perf tests impacted. |
|
/ok to test c797336 |
Replace the CPU-only mock-based tests with proper GPU tests that construct a real MiMoV2FlashTEDotProductAttention instance and run actual TE attention kernels. The previous tests failed in CI because autospec=True enforced the parent TEDotProductAttention.forward signature (which lacks **kwargs), rejecting the extra_kwarg test argument. The new tests: - Build a real module on GPU with TransformerConfig + TE - Intercept the value tensor reaching the parent forward to verify scaling - Cover: scale applied, parametrized scales, None passthrough, no in-place mutation, Q/K unchanged, output shape, scale sensitivity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Eisenhower <eisenhower@users.noreply.github.com>
|
/ok to test 650e37b |
Claude summary
Mirror of #4155 by @Eisenhower — copied into the upstream repo so the full CI pipeline runs natively (cross-fork PRs cannot trigger it).
Eisenhower/Megatron-Bridge-MiMo:fix/mimo-v2-flash-value-scaleCommits are copied verbatim with authorship preserved. Review and discussion remain on #4155.