[Release/3.4][Operator] cherry-pick mm out_dtype dynamic path#79285
Conversation
addmm incorrectly cast alpha/beta to tensor dtype (bf16/fp16) before passing to cuBLAS, causing significant scalar precision loss (e.g. alpha=2.9270 → bf16(2.921875), losing 0.17%). Use MPTypeTrait<T>::Type pattern (same as baddbmm) to keep scalars in float32 for half-precision types, matching PyTorch's opmath_type behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (32.35%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## release/3.4 #79285 +/- ##
==============================================
Coverage ? 32.35%
==============================================
Files ? 2
Lines ? 34
Branches ? 0
==============================================
Hits ? 11
Misses ? 23
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-10 15:29:50
📋 Review 摘要
PR 概述:为动态图 paddle.mm 增加 CUDA BF16 输入、FP32 输出的 out_dtype 最小支持,并保持静态图/PIR fail closed。
变更范围:Phi InferMeta、CUDA matmul kernel/BLAS 路径、dygraph YAML、Python API 与 legacy 单测。
影响面 Tag:[Operator Mechanism] [User Experience] [Performance Optimization]
问题
未发现阻塞性问题。PR 规范问题在下面章节报,不要在这里重复。
📝 PR 规范检查
符合规范。标题包含合法 Tag,描述包含 release 分支 cherry-pick 的 develop PR 链接,且精度变化字段已填写为“否”。
总体评价
本轮按风险优先审查了新增 mm_out_dtype 的 Python 入口、dygraph YAML、InferMeta、CUDA kernel 注册、BF16 GEMM 调用链、out= 路径和新增测试。当前未形成可阻塞发现;受时间上限影响,PIR/codegen 生成产物和更宽硬件矩阵未继续展开,待后续深挖。
PR Category
Operator Mechanism
PR Types
Improvements
Description
该 PR 基于 release/3.4 手动完整 cherry-pick develop PR #79252 中的
paddle.mmdynamic-onlyout_dtype改动,用于替代自动 cherry-pick PR #79282 中 CI 暴露的问题。主要内容:
paddle.mm增加临时最小out_dtype支持,仅支持 CUDA 2-D BF16 x BF16 -> FP32。out_dtypefail closed。mm_out_dtypedygraph op、InferMeta、CUDA PHI kernel 与 BF16 输入 FP32 输出 cuBLAS GEMM 路径。convert_nptype_to_datatype_or_vartype导出,修复自动 cherry-pick 中test_static_out_dtype_fails_closed遇到的 NameError。验证说明:
git diff --check。devPR:#79252 (comment)
pcard-91067
是否引起精度变化
否