Lower device globals to kernel arguments (no program-scope globals)#1289
Draft
pvelesko wants to merge 7 commits into
Draft
Lower device globals to kernel arguments (no program-scope globals)#1289pvelesko wants to merge 7 commits into
pvelesko wants to merge 7 commits into
Conversation
1be6e29 to
2cfa4be
Compare
Collaborator
Author
|
/run-aurora-ci |
…lobals OFF)
Drivers like rusticl/radeonsi cannot consume program-scope CrossWorkgroup
globals, so chipStar's default `__chip_var_<name>` address-holder lowering
fails there. When CHIP_ENABLE_DEVICE_PROGRAM_SCOPE_GLOBALS is OFF, a new
post-transform in HipGlobalVariables replaces each such global with an implicit
trailing kernel pointer argument carrying the global's device address:
* kernels that load `__chip_var_<G>` get a trailing pointer param per global;
the load becomes ptrtoint(param).
* a `__chip_gvararg_<kernel>` annotation (NUL-separated global names) tells the
runtime which global feeds each trailing arg; spirv.cc parses it, marks the
args SPVTypeKind::DeviceGlobal, and strips the annotation before the driver.
* the `__chip_var_init_<G>` kernel takes the storage address as an arg instead
of reading the global; the `__chip_var_bind_<G>` kernel is removed.
* setupAllArgs binds each DeviceGlobal arg to the global's DeviceVar address
via clSetKernelArgSVMPointer.
All gated; the default (flag ON) path is unchanged. Passes on rusticl for
TestConstantMemory, TestGlobalVarInit, and a minimal __device__ read.
…rg list copyKernelArgs / visitClientArgs must not surface the implicit device-global address arguments (they are runtime-provided, not HIP-client args), otherwise the graph kernel-arg capture path asserts on the unexpected arg kind.
…-node prepare - Only convert globals whose every use is a load in a user kernel or their own init kernel (or the own-bind store). Globals that are address-taken, accessed in non-kernel functions, or cross-referenced from another global's init (e.g. mutually-referencing pointers Foo=&Bar; Bar=&Foo) are left as program-scope globals — a safe fallback on drivers that support them. - Add the DeviceGlobal argument case to the Level Zero backend's setupAllArgs. - Allocate a kernel's module device variables in the graph kernel-node constructor before setupAllArgs binds the implicit device-global args (graph construction happens before any launch, which is where this normally happens). With these, the flag-OFF (globals-as-kernel-args) build passes the Level Zero unit suite except pre-existing flaky/unrelated tests.
Add a --no-psg flag to unit_tests.sh that configures the build with -DCHIP_ENABLE_DEVICE_PROGRAM_SCOPE_GLOBALS=OFF, then run the latest-LLVM native unit-test job twice: once with program-scope globals ON (default) and once OFF. The OFF build lowers user __device__/__constant__ globals to kernel arguments (issue #1279), so a single native run exercises both code paths on a PSG-capable backend (Level Zero / Arc).
…w fixes on rusticl The chipstar-rusticl section masked 22 device-symbol/global/constant tests as 'rusticl cannot consume program-scope CrossWorkgroup globals'. With the kernel-argument lowering (issue #1279, this PR), 18 of them now pass on the W6400 (verified against the rusticl/radeonsi W6400 with a PSG-OFF build and the CI's exact ICD/env). Removing those 18 makes the rusticl CI actually run and validate the fix instead of skipping it. The 4 that still fail are kept with corrected, accurate justifications: - hipTestSymbolInit / hipTestResetStaticVar: cross-referenced and function-local-static globals are not convertible to kernel arguments and fall back to program-scope globals (rusticl/Mesa panics on them). - TestAtomics: clLinkProgram -17 (ACO int64/system-scope atomics gap), not a device-global issue. - TestLargeGlobalVar: times out (>180s) on the W6400.
…rrency flaky After unmasking the device-global tests, the rusticl CI surfaced that the three Unit_hipGraphAddMemcpyNodeFromSymbol_Global* tests intermittently fail under the full -j4 suite on the W6400: the graph memcpy-to-symbol -> memcpy-from-symbol round-trip reads stale data under concurrent load. This is not a kernel-argument-lowering defect: the same PSG-OFF build passes these tests reliably in isolation on rusticl (60+ runs), and consistently on the Intel iGPU (OpenCL) and Level Zero. It is a rusticl/radeonsi driver ordering gap that only manifests under full-suite concurrency, so they are re-masked on rusticl only, with an accurate reason, and tracked as a #1279 follow-up. The other 15 previously-masked device-global tests remain unmasked and now actually gate the fix on rusticl.
…not flaky)
Root-caused the Unit_hipHostMalloc_CoherentAccess failure in the native
PSG-OFF sweep: it is deterministic, not flaky. Its kernel busy-waits on
clock64():
uint64_t start = clock64()/clkRate, cur;
do { cur = clock64()/clkRate - start; } while (cur < wait_sec);
With program-scope device globals OFF, clock64() returns 0 (the
__chip_clk_counter program-scope global is omitted -- the same reason the
clock sample is gated), so the loop never terminates: the kernel hangs and
the GPU watchdog aborts the queue with CL_OUT_OF_RESOURCES. That is exactly
why it passed the PSG-ON sweep and aborted the PSG-OFF sweep of the same run.
Fix: unit_tests.sh excludes this test only in the --no-psg run (it uses a
feature intentionally unavailable there); it still runs with PSG ON. Also
reverts the incorrect known_failures 'flaky' mask added on meatloaf.
46897e2 to
6f212ac
Compare
Collaborator
Author
|
/run-aurora-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.