[Performance] Avoid repeat_interleave GPU sync when output_size is provided by Sebastian-dong · Pull Request #79281 · PaddlePaddle/Paddle

Sebastian-dong · 2026-06-08T13:39:49Z

PR Category

Performance Optimization

PR Types

Performance

Description

This PR optimizes the tensor-repeats GPU path of repeat_interleave when output_size is already provided.

Previously, the GPU implementation still materialized the expanded index length by copying the last prefix-sum value back to host and synchronizing the stream, even when output_size had already been supplied by the caller. This introduced unnecessary GPU-to-CPU synchronization in both forward and backward, and also made the output_size fast path less CUDA-Graph-friendly than expected.

This PR updates the related helper and GPU kernels so that:

when output_size is provided, the expanded index tensor is allocated directly from that size
the sum(repeats) == output_size check is performed on device
the old host-sync fallback is kept only for the output_size == -1 path
the same optimization is wired into both forward and backward GPU paths

In addition, a CUDA Graph regression test is added for repeat_interleave(..., output_size=...) with tensor-typed repeats.

Benchmark

Benchmark was run under the same containerized build environment, comparing the patched implementation against the pre-change implementation rebuilt from the same workspace. This is not a main/develop branch comparison.

For repeat_interleave(repeats=<Tensor>, output_size=...):

shape [128, 4096]: forward 0.0360 ms -> 0.0250 ms (30.6% faster), backward 0.1762 ms -> 0.1121 ms (36.4% faster)
shape [512, 4096]: forward 0.0612 ms -> 0.0483 ms (21.1% faster), backward 0.3734 ms -> 0.3399 ms (9.0% faster)

For the same benchmark setup, CUDA Graph capture succeeds after this change, while the pre-change implementation fails on the same output_size path.

是否引起精度变化

否

…ovided

CLAassistant · 2026-06-08T13:40:01Z

All committers have signed the CLA.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-08 23:15:08

📋 Review 摘要

PR 概述：优化 GPU tensor repeats 的 repeat_interleave(output_size=...)，避免前后向通过 D2H 拷贝计算 index 长度。
变更范围：paddle/phi/kernels/funcs/、GPU repeat_interleave 前后向、CUDA Graph 回归测试。
影响面 Tag：OP GPU

问题

级别	文件	概述
🔴 Bug	`paddle/phi/kernels/funcs/repeat_tensor2index_tensor.cu:98`	空 repeats 时跳过新增设备端校验，错误的正 output_size 会被静默接受

📝 PR 规范检查

符合规范。标题使用了有效 Tag，描述包含四个必填 section，精度变化字段为“否”。

总体评价

优化方向合理，但新增校验没有覆盖 num_reps == 0 的早返回路径，导致 GPU 与 CPU/旧 GPU 在 invalid output_size 上行为不一致。建议修复后再合入。

PaddlePaddle-bot · 2026-06-08T15:16:52Z

-  cudaStreamSynchronize(stream);
-  int64_t total_size =
-      static_cast<int64_t>(last_prefix) + static_cast<int64_t>(last_repeat);
+  int64_t total_size = output_size;


🔴 Bug num_reps == 0 时仍需要校验正 output_size。

这个 functor 在上面的 if (num_reps == 0) 会直接 Resize({0}) 并返回，所以这里新增的 ValidateOutputSizeKernel 完全不会执行。前向 GPU wrapper 本次又删除了原来的 host 端 PADDLE_ENFORCE_EQ(output_size, index.dims()[0])，因此 paddle.repeat_interleave(paddle.empty([0]).cuda(), paddle.empty([0], dtype='int32').cuda(), axis=0, output_size=1) 会静默得到空输出，而 CPU/旧 GPU 路径会因为 output_size != sum(repeats) 抛错。

建议修复方式：在 num_reps == 0 早返回前保留一致性检查，例如 output_size > 0 时直接抛 InvalidArgument；同时补一个 GPU 空输入、空 repeats、错误正 output_size 的回归测试。

Sebastian-dong · 2026-06-08T15:52:41Z

@PaddlePaddle-bot fixed.

The GPU num_reps == 0 early-return path now validates positive output_size consistently, and a CUDA regression test for the zero-repeats invalid output_size case has been added.

Updated in commit 8949f8e384.

[Performance] Avoid repeat_interleave GPU sync when output_size is pr…

a1a5c0c

…ovided

PaddlePaddle-bot suggested changes Jun 8, 2026

View reviewed changes

[Performance] Validate zero-repeat output_size on GPU

8949f8e

paddle-bot Bot added the contributor External developers label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Avoid repeat_interleave GPU sync when output_size is provided#79281

[Performance] Avoid repeat_interleave GPU sync when output_size is provided#79281
Sebastian-dong wants to merge 2 commits into
PaddlePaddle:developfrom
Sebastian-dong:perf/repeat-interleave-output-size-no-sync

Sebastian-dong commented Jun 8, 2026

Uh oh!

CLAassistant commented Jun 8, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 8, 2026

Uh oh!

Sebastian-dong commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sebastian-dong commented Jun 8, 2026

PR Category

PR Types

Description

Benchmark

是否引起精度变化

Uh oh!

CLAassistant commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Sebastian-dong commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 8, 2026 •

edited

Loading