Skip to content

zeImageCreate fails with ZE_RESULT_ERROR on first Level Zero context use (Aurora/PVC), succeeds on second invocation #1255

@pvelesko

Description

@pvelesko

Platform

  • Hardware: Intel Data Center GPU Max 1550 (Aurora @ ALCF / PVC)
  • chipStar version: 2026.04.29
  • Backend: Level Zero (CHIP_BE=LEVEL0)
  • oneAPI version: 2025.3

Symptom

The first call to hipCreateTextureObject in a fresh Level Zero process returns hipErrorInvalidValue. A second process starting approximately 37 seconds later, running identical code on the same node, succeeds without error.

The error propagates as:

Binding of the texture object failed. CUDA error #1 (hipErrorInvalidValue)

The failure path (from source inspection):
hipCreateTextureObjectCHIPDeviceLevel0::createTextureallocateImagezeImageCreate fails → CHIPERR_CHECK_LOG_AND_THROW_TABLE throws → exception propagates → caught in CHIP_CATCH → returns hipErrorInvalidValue to caller.

Additionally, during the ~32 seconds before the crash, 7000+ warnings of the form:

CHIPEventLevel0::reset() called while event is recording

are emitted, suggesting the Level Zero event pool machinery is not fully initialized when zeImageCreate is called.

Reproduction

Run any GROMACS mdrun with -nb gpu -pme gpu using the chipStar Level Zero backend inside a PBS job:

  • Run 1 (first process on fresh node): crashes with segfault (triggered by hipErrorInvalidValue from hipCreateTextureObject)
  • Run 2 (second process on same node, started ~37s after run 1 exits): succeeds with identical binary and inputs

The timing difference (first vs. second invocation) is the distinguishing factor, not anything about the input or binary.

Workaround

Pre-warm the GPU with a short run using -pme cpu (which avoids calling hipCreateTextureObject) before the real GPU-PME run. After warmup completes, zeImageCreate succeeds in the subsequent process.

Example warmup step in a PBS script:

# Warmup: 10-step run using CPU PME (no hipCreateTextureObject)
mpiexec ... mdrun -nb gpu -pme cpu -nsteps 10 ...

# Real run: GPU PME now succeeds
mpiexec ... mdrun -nb gpu -pme gpu ...

Hypothesis

The Level Zero driver on PVC appears to defer some image subsystem initialization until after the first batch of kernel/event operations completes. The ~32 seconds of event cycling activity observed in chipStar's event pool seems necessary for the driver to reach a state where zeImageCreate succeeds.

This suggests a driver-side lazy initialization race: if zeImageCreate is called before sufficient Level Zero activity has occurred in the process (or possibly before the driver's background initialization thread finishes), the call fails with an error that would otherwise not occur once the driver is "warmed up."

Related Issues

The segfault that occurs during C++ exception unwinding through chipStar's destructor (triggered by the hipErrorInvalidValue error propagation) is filed as a separate issue.

Additional Context

  • The CHIPEventLevel0::reset() called while event is recording warning storm (7000+ occurrences over ~32s) immediately precedes the crash and may be a symptom of the same underlying driver state issue.
  • The bug is consistently reproducible on Aurora (ALCF) with Intel Data Center GPU Max 1550 hardware and does not appear to be input-dependent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions