[Operator Mechanism] Support mm out_dtype for BF16 CUDA by A-nnonymous · Pull Request #79252 · PaddlePaddle/Paddle

A-nnonymous · 2026-06-04T15:10:30Z

PR Category

Operator Mechanism

PR Types

New features

Description

Temporary add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests.

pcard-91067

是否引起精度变化

否

Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PaddlePaddle-bot · 2026-06-04T20:25:42Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-09 02:53:51 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 11579f8 | Merge base: 8df3d8a (branch: develop)

1 Required任务 : 41/48 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
89(9)	80	73	3	1	0	0

任务	错误类型	置信度	日志
`Static-Check / Test`	PR问题：API approval 未同步	高	Job
`Coverage test`	PR问题：新增代码覆盖率不足	高	Job
`Check approval`	需要 Approval	高	Job

2 失败详情

🔴 Static-Check / Test — PR问题（置信度: 高）

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: API approval 检查

用例	错误摘要
`tools/check_api_approvals.sh`	`paddle.Tensor.mm` 公开签名新增 `out_dtype`，API approval 未更新

关键日志:

API Difference is:
- paddle.Tensor.mm(... kwonlyargs=['out'] ...)
+ paddle.Tensor.mm(... kwonlyargs=['out_dtype', 'out'] ...)
##[error]Process completed with exit code 6.

根因摘要: paddle.Tensor.mm API 签名变更未审批
PR 在 python/paddle/tensor/math.py 为 mm 新增 out_dtype keyword，导致 API 签名 hash 变化。静态检查要求同步 API approval，否则按未审批公开 API 变更失败。

修复建议:

按 Paddle API approval 流程更新对应 approval 基线，或确认该公开签名变更无需保留后调整实现。
关联文件：python/paddle/tensor/math.py。

关联变更: python/paddle/tensor/math.py 新增 out_dtype 参数。

🔴 Coverage test — PR问题（置信度: 高）

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: 覆盖率门禁

用例	错误摘要
`Assert Diff Coverage`, `Assert Python Diff Coverage`	C++ diff coverage 0.0%、Python diff coverage 64.7%，低于 90% 阈值

关键日志:

Assert Diff Coverage
expected >= 90.0 %, actual 0.0 %, failed
Assert Python Diff Coverage
expected >= 90.0 %, actual 64.7 %, failed
Coverage check failed, unit tests have all passed, please do not rerun

根因摘要: 新增 mm out_dtype 分支覆盖率不足
PR 新增 MmOutDtypeInferMeta、mm_out_dtype CUDA kernel 路径和 python/paddle/tensor/math.py 的 out_dtype 分支，但覆盖率环境未覆盖这些新增行。日志中的 test_enable_cinn_kernel_cache 初次失败后 rerun 通过，最终失败点是 coverage diff 门禁，不是该 CINN 用例。

修复建议:

为 python/paddle/tensor/math.py 的 out_dtype 参数校验分支补充不依赖 SM80 BF16 成功路径的覆盖用例。
为 paddle/phi/infermeta/binary.cc 中 MmOutDtypeInferMeta 等新增 C++ 逻辑补充可在 coverage 环境执行的覆盖。
关联文件：paddle/phi/infermeta/binary.cc、python/paddle/tensor/math.py、test/legacy_test/test_mm_out.py。

关联变更: 新增 mm BF16 -> FP32 out_dtype schema、infermeta、CUDA kernel 和 Python API 分支。

🔴 Check approval — 需要 Approval（置信度: 高）

错误类型: 需要 Approval | 置信度: 高
分析器: builtin
失败用例: Approval 检查

用例	错误摘要
`Approval`	该 Job 需要人工 Approval，完成审批后 CI 才会继续执行

关键日志:

Process completed with exit code 6.

根因摘要: CI 需要人工 Approval
该任务属于 approval 阻塞，不是代码编译或测试失败。

修复建议:

请通过人工审批。

关联变更: 无。

Use the canonical matmul path for static mm out_dtype handling and keep legacy compatibility attrs limited to supported types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Keep matmul compatible with unknown symbolic dimensions and legacy matmul_v2 to PIR translation when out_dtype is unset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Preserve the legacy static mm path when out_dtype is unset and avoid rejecting unknown symbolic matmul dimensions during InferMeta. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Allow the explicit static out_dtype path to pass BF16 variables through Python validation and feed BF16 static test data using the existing uint16 encoding helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add missing default/propagated out_dtype handling for legacy matmul translation, PIR serialization compatibility, and handwritten PIR/DRR matmul rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Avoid fusing explicit matmul out_dtype paths in PIR rewrite passes, document BF16 GEMM lda/ldb narrowing safety, and add a legacy matmul_v2 translator regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Route static mm out_dtype through matmul_v2 so it reaches the phi matmul kernel, preserve user-provided out tensors, and let legacy matmul_v2 fusion pass compatibility accept only missing/default out_dtype while rejecting explicit output dtype paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A-nnonymous · 2026-06-08T05:36:03Z

/re-run all-failed

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-08 22:39:26

📋 Review 摘要

PR 概述：为 paddle.mm(out_dtype=paddle.float32) 增加 CUDA BF16 x BF16 -> FP32 前向路径。
变更范围：paddle.mm Python API、dygraph op YAML、InferMeta、GPU matmul kernel、cuBLAS BF16 GEMM、legacy 单测。
影响面 Tag：[Operator Mechanism] [User Experience]

问题

级别	文件	概述
🔴 Bug	`paddle/phi/ops/yaml/inconsistent/dygraph_ops.yaml:297`	`paddle.mm(out_dtype=...)` 分支没有生成 backward，训练场景会丢失梯度

历史 Findings 修复情况

Finding	问题	状态
F1	`paddle.mm(..., out_dtype=paddle.float32)` 在旧静态图落到 `matmul_v2`	✅ 已修复

📝 PR 规范检查

符合规范。PR Category、PR Types、Description、精度变化字段均已填写。

总体评价

前向 CUDA BF16 GEMM 路径、InferMeta 和注册链路基本对齐，历史静态图误落 matmul_v2 的问题也已通过显式 NotImplementedError 关闭。但当前公开 API 分支会失去 mm 原有的一阶梯度能力，需要补齐 backward 或在 Python 入口明确禁止需要梯度的调用后再合入。

codecov-commenter · 2026-06-08T16:21:38Z

Codecov Report

❌ Patch coverage is 32.35294% with 23 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8df3d8a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
paddle/phi/infermeta/binary.cc	0.00%	17 Missing ⚠️
python/paddle/tensor/math.py	64.70%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #79252   +/-   ##
==========================================
  Coverage           ?   32.35%           
==========================================
  Files              ?        2           
  Lines              ?       34           
  Branches           ?        0           
==========================================
  Hits               ?       11           
  Misses             ?       23           
  Partials           ?        0

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

XiaoguangHu01

LGTM

wanghuancoder

LGTM

sneaxiy

LGTM

risemeup1111 · 2026-06-09T06:24:42Z

✅ Cherry-pick successful! Created PR: #79282

[Operator Mechanism] Support mm out_dtype for BF16 CUDA

47c173e

Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

A-nnonymous requested review from ForFishes, From00, LiYuRio and zhangbo9674 as code owners June 4, 2026 15:10