zeImageCreate fails with ZE_RESULT_ERROR on first Level Zero context use (Aurora/PVC), succeeds on second invocation

## Platform

- **Hardware**: Intel Data Center GPU Max 1550 (Aurora @ ALCF / PVC)
- **chipStar version**: 2026.04.29
- **Backend**: Level Zero (`CHIP_BE=LEVEL0`)
- **oneAPI version**: 2025.3

## Symptom

The first call to `hipCreateTextureObject` in a fresh Level Zero process returns `hipErrorInvalidValue`. A second process starting approximately 37 seconds later, running identical code on the same node, succeeds without error.

The error propagates as:
```
Binding of the texture object failed. CUDA error #1 (hipErrorInvalidValue)
```

The failure path (from source inspection):
`hipCreateTextureObject` → `CHIPDeviceLevel0::createTexture` → `allocateImage` → `zeImageCreate` fails → `CHIPERR_CHECK_LOG_AND_THROW_TABLE` throws → exception propagates → caught in `CHIP_CATCH` → returns `hipErrorInvalidValue` to caller.

Additionally, during the ~32 seconds before the crash, 7000+ warnings of the form:
```
CHIPEventLevel0::reset() called while event is recording
```
are emitted, suggesting the Level Zero event pool machinery is not fully initialized when `zeImageCreate` is called.

## Reproduction

Run any GROMACS `mdrun` with `-nb gpu -pme gpu` using the chipStar Level Zero backend inside a PBS job:

- **Run 1** (first process on fresh node): crashes with segfault (triggered by `hipErrorInvalidValue` from `hipCreateTextureObject`)
- **Run 2** (second process on same node, started ~37s after run 1 exits): succeeds with identical binary and inputs

The timing difference (first vs. second invocation) is the distinguishing factor, not anything about the input or binary.

## Workaround

Pre-warm the GPU with a short run using `-pme cpu` (which avoids calling `hipCreateTextureObject`) before the real GPU-PME run. After warmup completes, `zeImageCreate` succeeds in the subsequent process.

Example warmup step in a PBS script:
```bash
# Warmup: 10-step run using CPU PME (no hipCreateTextureObject)
mpiexec ... mdrun -nb gpu -pme cpu -nsteps 10 ...

# Real run: GPU PME now succeeds
mpiexec ... mdrun -nb gpu -pme gpu ...
```

## Hypothesis

The Level Zero driver on PVC appears to defer some image subsystem initialization until after the first batch of kernel/event operations completes. The ~32 seconds of event cycling activity observed in chipStar's event pool seems necessary for the driver to reach a state where `zeImageCreate` succeeds.

This suggests a driver-side lazy initialization race: if `zeImageCreate` is called before sufficient Level Zero activity has occurred in the process (or possibly before the driver's background initialization thread finishes), the call fails with an error that would otherwise not occur once the driver is "warmed up."

## Related Issues

The segfault that occurs during C++ exception unwinding through chipStar's destructor (triggered by the `hipErrorInvalidValue` error propagation) is filed as a separate issue.

## Additional Context

- The `CHIPEventLevel0::reset() called while event is recording` warning storm (7000+ occurrences over ~32s) immediately precedes the crash and may be a symptom of the same underlying driver state issue.
- The bug is consistently reproducible on Aurora (ALCF) with Intel Data Center GPU Max 1550 hardware and does not appear to be input-dependent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zeImageCreate fails with ZE_RESULT_ERROR on first Level Zero context use (Aurora/PVC), succeeds on second invocation #1255

Platform

Symptom

Reproduction

Workaround

Hypothesis

Related Issues

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

zeImageCreate fails with ZE_RESULT_ERROR on first Level Zero context use (Aurora/PVC), succeeds on second invocation #1255

Description

Platform

Symptom

Reproduction

Workaround

Hypothesis

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions