[Operator Mechanism] Support mm out_dtype for BF16 CUDA#79252
Conversation
Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 41/48 通过
2 失败详情🔴 Static-Check / Test — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: 🔴 Coverage test — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: 新增 mm BF16 -> FP32 out_dtype schema、infermeta、CUDA kernel 和 Python API 分支。 🔴 Check approval — 需要 Approval(置信度: 高)错误类型: 需要 Approval | 置信度: 高
关键日志:
修复建议:
关联变更: 无。 |
Use the canonical matmul path for static mm out_dtype handling and keep legacy compatibility attrs limited to supported types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keep matmul compatible with unknown symbolic dimensions and legacy matmul_v2 to PIR translation when out_dtype is unset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Preserve the legacy static mm path when out_dtype is unset and avoid rejecting unknown symbolic matmul dimensions during InferMeta. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Allow the explicit static out_dtype path to pass BF16 variables through Python validation and feed BF16 static test data using the existing uint16 encoding helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add missing default/propagated out_dtype handling for legacy matmul translation, PIR serialization compatibility, and handwritten PIR/DRR matmul rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Avoid fusing explicit matmul out_dtype paths in PIR rewrite passes, document BF16 GEMM lda/ldb narrowing safety, and add a legacy matmul_v2 translator regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Route static mm out_dtype through matmul_v2 so it reaches the phi matmul kernel, preserve user-provided out tensors, and let legacy matmul_v2 fusion pass compatibility accept only missing/default out_dtype while rejecting explicit output dtype paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/re-run all-failed |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-08 22:39:26
📋 Review 摘要
PR 概述:为 paddle.mm(out_dtype=paddle.float32) 增加 CUDA BF16 x BF16 -> FP32 前向路径。
变更范围:paddle.mm Python API、dygraph op YAML、InferMeta、GPU matmul kernel、cuBLAS BF16 GEMM、legacy 单测。
影响面 Tag:[Operator Mechanism] [User Experience]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | paddle/phi/ops/yaml/inconsistent/dygraph_ops.yaml:297 |
paddle.mm(out_dtype=...) 分支没有生成 backward,训练场景会丢失梯度 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | paddle.mm(..., out_dtype=paddle.float32) 在旧静态图落到 matmul_v2 |
✅ 已修复 |
📝 PR 规范检查
符合规范。PR Category、PR Types、Description、精度变化字段均已填写。
总体评价
前向 CUDA BF16 GEMM 路径、InferMeta 和注册链路基本对齐,历史静态图误落 matmul_v2 的问题也已通过显式 NotImplementedError 关闭。但当前公开 API 分支会失去 mm 原有的一阶梯度能力,需要补齐 backward 或在 Python 入口明确禁止需要梯度的调用后再合入。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #79252 +/- ##
==========================================
Coverage ? 32.35%
==========================================
Files ? 2
Lines ? 34
Branches ? 0
==========================================
Hits ? 11
Misses ? 23
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
✅ Cherry-pick successful! Created PR: #79282 |
PR Category
Operator Mechanism
PR Types
New features
Description
Temporary add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests.
pcard-91067
是否引起精度变化
否