[Performance] Avoid repeat_interleave GPU sync when output_size is provided#79281
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-08 23:15:08
📋 Review 摘要
PR 概述:优化 GPU tensor repeats 的 repeat_interleave(output_size=...),避免前后向通过 D2H 拷贝计算 index 长度。
变更范围:paddle/phi/kernels/funcs/、GPU repeat_interleave 前后向、CUDA Graph 回归测试。
影响面 Tag:OP GPU
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | paddle/phi/kernels/funcs/repeat_tensor2index_tensor.cu:98 |
空 repeats 时跳过新增设备端校验,错误的正 output_size 会被静默接受 |
📝 PR 规范检查
符合规范。标题使用了有效 Tag,描述包含四个必填 section,精度变化字段为“否”。
总体评价
优化方向合理,但新增校验没有覆盖 num_reps == 0 的早返回路径,导致 GPU 与 CPU/旧 GPU 在 invalid output_size 上行为不一致。建议修复后再合入。
| cudaStreamSynchronize(stream); | ||
| int64_t total_size = | ||
| static_cast<int64_t>(last_prefix) + static_cast<int64_t>(last_repeat); | ||
| int64_t total_size = output_size; |
There was a problem hiding this comment.
🔴 Bug num_reps == 0 时仍需要校验正 output_size。
这个 functor 在上面的 if (num_reps == 0) 会直接 Resize({0}) 并返回,所以这里新增的 ValidateOutputSizeKernel 完全不会执行。前向 GPU wrapper 本次又删除了原来的 host 端 PADDLE_ENFORCE_EQ(output_size, index.dims()[0]),因此 paddle.repeat_interleave(paddle.empty([0]).cuda(), paddle.empty([0], dtype='int32').cuda(), axis=0, output_size=1) 会静默得到空输出,而 CPU/旧 GPU 路径会因为 output_size != sum(repeats) 抛错。
建议修复方式:在 num_reps == 0 早返回前保留一致性检查,例如 output_size > 0 时直接抛 InvalidArgument;同时补一个 GPU 空输入、空 repeats、错误正 output_size 的回归测试。
|
@PaddlePaddle-bot fixed. The GPU Updated in commit |
PR Category
Performance Optimization
PR Types
Performance
Description
This PR optimizes the tensor-repeats GPU path of
repeat_interleavewhenoutput_sizeis already provided.Previously, the GPU implementation still materialized the expanded index length by copying the last prefix-sum value back to host and synchronizing the stream, even when
output_sizehad already been supplied by the caller. This introduced unnecessary GPU-to-CPU synchronization in both forward and backward, and also made theoutput_sizefast path less CUDA-Graph-friendly than expected.This PR updates the related helper and GPU kernels so that:
output_sizeis provided, the expanded index tensor is allocated directly from that sizesum(repeats) == output_sizecheck is performed on deviceoutput_size == -1pathIn addition, a CUDA Graph regression test is added for
repeat_interleave(..., output_size=...)with tensor-typedrepeats.Benchmark
Benchmark was run under the same containerized build environment, comparing the patched implementation against the pre-change implementation rebuilt from the same workspace. This is not a
main/developbranch comparison.For
repeat_interleave(repeats=<Tensor>, output_size=...):[128, 4096]: forward0.0360 ms -> 0.0250 ms(30.6%faster), backward0.1762 ms -> 0.1121 ms(36.4%faster)[512, 4096]: forward0.0612 ms -> 0.0483 ms(21.1%faster), backward0.3734 ms -> 0.3399 ms(9.0%faster)For the same benchmark setup, CUDA Graph capture succeeds after this change, while the pre-change implementation fails on the same
output_sizepath.是否引起精度变化
否