Platform
- Hardware: Intel Data Center GPU Max 1550 (Aurora @ ALCF / PVC)
- chipStar version: 2026.04.29
- Backend: Level Zero (
CHIP_BE=LEVEL0)
- oneAPI version: 2025.3
Symptom
The first call to hipCreateTextureObject in a fresh Level Zero process returns hipErrorInvalidValue. A second process starting approximately 37 seconds later, running identical code on the same node, succeeds without error.
The error propagates as:
Binding of the texture object failed. CUDA error #1 (hipErrorInvalidValue)
The failure path (from source inspection):
hipCreateTextureObject → CHIPDeviceLevel0::createTexture → allocateImage → zeImageCreate fails → CHIPERR_CHECK_LOG_AND_THROW_TABLE throws → exception propagates → caught in CHIP_CATCH → returns hipErrorInvalidValue to caller.
Additionally, during the ~32 seconds before the crash, 7000+ warnings of the form:
CHIPEventLevel0::reset() called while event is recording
are emitted, suggesting the Level Zero event pool machinery is not fully initialized when zeImageCreate is called.
Reproduction
Run any GROMACS mdrun with -nb gpu -pme gpu using the chipStar Level Zero backend inside a PBS job:
- Run 1 (first process on fresh node): crashes with segfault (triggered by
hipErrorInvalidValue from hipCreateTextureObject)
- Run 2 (second process on same node, started ~37s after run 1 exits): succeeds with identical binary and inputs
The timing difference (first vs. second invocation) is the distinguishing factor, not anything about the input or binary.
Workaround
Pre-warm the GPU with a short run using -pme cpu (which avoids calling hipCreateTextureObject) before the real GPU-PME run. After warmup completes, zeImageCreate succeeds in the subsequent process.
Example warmup step in a PBS script:
# Warmup: 10-step run using CPU PME (no hipCreateTextureObject)
mpiexec ... mdrun -nb gpu -pme cpu -nsteps 10 ...
# Real run: GPU PME now succeeds
mpiexec ... mdrun -nb gpu -pme gpu ...
Hypothesis
The Level Zero driver on PVC appears to defer some image subsystem initialization until after the first batch of kernel/event operations completes. The ~32 seconds of event cycling activity observed in chipStar's event pool seems necessary for the driver to reach a state where zeImageCreate succeeds.
This suggests a driver-side lazy initialization race: if zeImageCreate is called before sufficient Level Zero activity has occurred in the process (or possibly before the driver's background initialization thread finishes), the call fails with an error that would otherwise not occur once the driver is "warmed up."
Related Issues
The segfault that occurs during C++ exception unwinding through chipStar's destructor (triggered by the hipErrorInvalidValue error propagation) is filed as a separate issue.
Additional Context
- The
CHIPEventLevel0::reset() called while event is recording warning storm (7000+ occurrences over ~32s) immediately precedes the crash and may be a symptom of the same underlying driver state issue.
- The bug is consistently reproducible on Aurora (ALCF) with Intel Data Center GPU Max 1550 hardware and does not appear to be input-dependent.
Platform
CHIP_BE=LEVEL0)Symptom
The first call to
hipCreateTextureObjectin a fresh Level Zero process returnshipErrorInvalidValue. A second process starting approximately 37 seconds later, running identical code on the same node, succeeds without error.The error propagates as:
The failure path (from source inspection):
hipCreateTextureObject→CHIPDeviceLevel0::createTexture→allocateImage→zeImageCreatefails →CHIPERR_CHECK_LOG_AND_THROW_TABLEthrows → exception propagates → caught inCHIP_CATCH→ returnshipErrorInvalidValueto caller.Additionally, during the ~32 seconds before the crash, 7000+ warnings of the form:
are emitted, suggesting the Level Zero event pool machinery is not fully initialized when
zeImageCreateis called.Reproduction
Run any GROMACS
mdrunwith-nb gpu -pme gpuusing the chipStar Level Zero backend inside a PBS job:hipErrorInvalidValuefromhipCreateTextureObject)The timing difference (first vs. second invocation) is the distinguishing factor, not anything about the input or binary.
Workaround
Pre-warm the GPU with a short run using
-pme cpu(which avoids callinghipCreateTextureObject) before the real GPU-PME run. After warmup completes,zeImageCreatesucceeds in the subsequent process.Example warmup step in a PBS script:
Hypothesis
The Level Zero driver on PVC appears to defer some image subsystem initialization until after the first batch of kernel/event operations completes. The ~32 seconds of event cycling activity observed in chipStar's event pool seems necessary for the driver to reach a state where
zeImageCreatesucceeds.This suggests a driver-side lazy initialization race: if
zeImageCreateis called before sufficient Level Zero activity has occurred in the process (or possibly before the driver's background initialization thread finishes), the call fails with an error that would otherwise not occur once the driver is "warmed up."Related Issues
The segfault that occurs during C++ exception unwinding through chipStar's destructor (triggered by the
hipErrorInvalidValueerror propagation) is filed as a separate issue.Additional Context
CHIPEventLevel0::reset() called while event is recordingwarning storm (7000+ occurrences over ~32s) immediately precedes the crash and may be a symptom of the same underlying driver state issue.