Skip to content

[Performance] Avoid repeat_interleave GPU sync when output_size is provided#79281

Open
Sebastian-dong wants to merge 2 commits into
PaddlePaddle:developfrom
Sebastian-dong:perf/repeat-interleave-output-size-no-sync
Open

[Performance] Avoid repeat_interleave GPU sync when output_size is provided#79281
Sebastian-dong wants to merge 2 commits into
PaddlePaddle:developfrom
Sebastian-dong:perf/repeat-interleave-output-size-no-sync

Conversation

@Sebastian-dong

Copy link
Copy Markdown

PR Category

Performance Optimization

PR Types

Performance

Description

This PR optimizes the tensor-repeats GPU path of repeat_interleave when output_size is already provided.

Previously, the GPU implementation still materialized the expanded index length by copying the last prefix-sum value back to host and synchronizing the stream, even when output_size had already been supplied by the caller. This introduced unnecessary GPU-to-CPU synchronization in both forward and backward, and also made the output_size fast path less CUDA-Graph-friendly than expected.

This PR updates the related helper and GPU kernels so that:

  • when output_size is provided, the expanded index tensor is allocated directly from that size
  • the sum(repeats) == output_size check is performed on device
  • the old host-sync fallback is kept only for the output_size == -1 path
  • the same optimization is wired into both forward and backward GPU paths

In addition, a CUDA Graph regression test is added for repeat_interleave(..., output_size=...) with tensor-typed repeats.

Benchmark

Benchmark was run under the same containerized build environment, comparing the patched implementation against the pre-change implementation rebuilt from the same workspace. This is not a main/develop branch comparison.

For repeat_interleave(repeats=<Tensor>, output_size=...):

  • shape [128, 4096]: forward 0.0360 ms -> 0.0250 ms (30.6% faster), backward 0.1762 ms -> 0.1121 ms (36.4% faster)
  • shape [512, 4096]: forward 0.0612 ms -> 0.0483 ms (21.1% faster), backward 0.3734 ms -> 0.3399 ms (9.0% faster)

For the same benchmark setup, CUDA Graph capture succeeds after this change, while the pre-change implementation fails on the same output_size path.

是否引起精度变化

@CLAassistant

CLAassistant commented Jun 8, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-08 23:15:08

📋 Review 摘要

PR 概述:优化 GPU tensor repeats 的 repeat_interleave(output_size=...),避免前后向通过 D2H 拷贝计算 index 长度。
变更范围paddle/phi/kernels/funcs/、GPU repeat_interleave 前后向、CUDA Graph 回归测试。
影响面 TagOP GPU

问题

级别 文件 概述
🔴 Bug paddle/phi/kernels/funcs/repeat_tensor2index_tensor.cu:98 空 repeats 时跳过新增设备端校验,错误的正 output_size 会被静默接受

📝 PR 规范检查

符合规范。标题使用了有效 Tag,描述包含四个必填 section,精度变化字段为“否”。

总体评价

优化方向合理,但新增校验没有覆盖 num_reps == 0 的早返回路径,导致 GPU 与 CPU/旧 GPU 在 invalid output_size 上行为不一致。建议修复后再合入。

cudaStreamSynchronize(stream);
int64_t total_size =
static_cast<int64_t>(last_prefix) + static_cast<int64_t>(last_repeat);
int64_t total_size = output_size;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug num_reps == 0 时仍需要校验正 output_size

这个 functor 在上面的 if (num_reps == 0) 会直接 Resize({0}) 并返回,所以这里新增的 ValidateOutputSizeKernel 完全不会执行。前向 GPU wrapper 本次又删除了原来的 host 端 PADDLE_ENFORCE_EQ(output_size, index.dims()[0]),因此 paddle.repeat_interleave(paddle.empty([0]).cuda(), paddle.empty([0], dtype='int32').cuda(), axis=0, output_size=1) 会静默得到空输出,而 CPU/旧 GPU 路径会因为 output_size != sum(repeats) 抛错。

建议修复方式:在 num_reps == 0 早返回前保留一致性检查,例如 output_size > 0 时直接抛 InvalidArgument;同时补一个 GPU 空输入、空 repeats、错误正 output_size 的回归测试。

@Sebastian-dong

Copy link
Copy Markdown
Author

@PaddlePaddle-bot fixed.

The GPU num_reps == 0 early-return path now validates positive output_size consistently, and a CUDA regression test for the zero-repeats invalid output_size case has been added.

Updated in commit 8949f8e384.

@paddle-bot paddle-bot Bot added the contributor External developers label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants