Skip to content

[L0] share one immediate CL per hardware queue to eliminate dispatch overhead#1293

Draft
pvelesko wants to merge 1 commit into
mainfrom
fix-l0-event-hotpath
Draft

[L0] share one immediate CL per hardware queue to eliminate dispatch overhead#1293
pvelesko wants to merge 1 commit into
mainfrom
fix-l0-event-hotpath

Conversation

@pvelesko

Copy link
Copy Markdown
Collaborator

Reduce Level Zero kernel dispatch overhead by sharing a single immediate command list per hardware queue instead of allocating one per submission.

@pvelesko pvelesko force-pushed the fix-l0-event-hotpath branch 2 times, most recently from 1903fa8 to 7b713bf Compare June 11, 2026 15:16
…overhead

Intel Arc B570 (numQueues=1): zeCommandListAppendLaunchKernel takes ~0.45ms
when switching between different immediate CL handles, but only ~0.013ms when
reusing the same handle. With N HIP streams each owning a private CL, launching
N kernels cost N×0.45ms = 14.4ms at N=32 — identical to OCL's 3.3ms.

Fix: CHIPDeviceLevel0 now maintains a SharedImmCLs_ pool keyed by
(ordinal<<32|index). All streams that map to the same hardware queue share one
CL handle via getOrCreateSharedImmCL(). On B570 (numQueues=1) all streams share
one CL → dispatch falls to 32×0.013ms = 0.4ms → perlin-hip N=32: 3.13ms (was
13.97ms), matching OCL at 3.28ms.

On multi-queue devices (e.g. PVC numQueues=4) streams get distinct shared CLs
per slot, preserving current parallelism behavior.

Also fix:
- finish() two-flag skip (IsEmptyQueue_ && CmdListInitialized_) to avoid the
  ~0.4ms zeCommandListHostSynchronize overhead on never-synced CLs
- Lazy checkEvents() in getEventFromPool(): only call when pool is exhausted,
  not on every kernel dispatch
@pvelesko pvelesko force-pushed the fix-l0-event-hotpath branch from 7b713bf to 3cae92f Compare June 13, 2026 15:37
@pvelesko

Copy link
Copy Markdown
Collaborator Author

/run-aurora-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant