Wrong-code on Xe2 (Arc B580): 32-byte vector USM load returns the wrong row; broken emulated 64-bit addressing under register pressure

# Summary

On Intel Arc B580 (Battlemage / Xe2), IGC JIT-compiles a small SYCL kernel so
that a `sycl::vec<uint32_t, 8>` load from a 2 GiB `malloc_device` USM
allocation at a runtime 64-bit-derived index **silently returns data from the
wrong address**. Depending on configuration the same defect also shows as a
compute exception (ccs engine reset) or a hang.

The minimized reproducer (attached, ~230 lines) fails at **every** JIT
optimization level (`SYCL_PROGRAM_COMPILE_OPTIONS` `-O0`..`-O3`) and with
**both** front ends — oneAPI 2026.0.0 release `icpx` and intel/llvm
`nightly-2026-06-12` `clang++` — producing the **identical** wrong row, so the
defect is in IGC code generation, not in the front end. The same binary
passes on the OpenCL **CPU** device (kernel logic is proven correct).

First hit in production kernels (MoneroOcean mo-miner, autolykos2): there,
table reads return wrong data for ~41–50 % of indices — matching the fraction
of offsets into the 2 GiB table whose 64-bit address calculation needs a carry
into the high dword from the allocation base (Xe2 has no native 64-bit int
ops). vISA inspection shows the address arithmetic of this kernel's loads
breaking under high register pressure, while sibling low-pressure kernels with
the same source helper compile correctly. Forcing a carry-safe allocation
placement (base low32 + 2 GiB <= 4 GiB, so no in-bounds offset ever carries)
still produces wrong reads, so it is not merely a dropped carry.

## Environment

- GPU: Intel Arc B580 (Battlemage / Xe2, PCI 0xe20b)
- Ubuntu 24.04 userspace (containerized), host kernel 6.17.0-35, `xe` driver
- OpenCL runtime: NEO 26.05.37020.3, libigc2 2.28.4
  (IGC git-hash `6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124`)
- Front ends (both reproduce): oneAPI DPC++ 2026.0.0 (2026.0.0.20260331);
  intel/llvm nightly-2026-06-12 (`24322f9e1ad3`)

## Reproducer

`repro.cpp` (attached in [bug_report1_repro.zip](https://github.com/user-attachments/files/28886959/bug_report1_repro.zip)): builds a 2 GiB device table with a deterministic
per-word pattern, then 64 work items each run two chained stages of
{2 statically-unrolled BLAKE2b rounds → derive row index → load 32-byte row as
`sycl::vec<uint32_t,8>` → fold row into next stage}. The final row of work
item 0 is compared against the host recomputation.

```sh
icpx -fsycl -O2 repro.cpp -o repro
ONEAPI_DEVICE_SELECTOR=opencl:gpu SYCL_PROGRAM_COMPILE_OPTIONS=-O3 ./repro
```

Actual (B580, identical wrong row from 2026.0 and nightly builds):

```text
index   : 139774
got     : 31611897 791ba32c 5b4bd021 4b073452 7fd57710 d6e40eec 4bdb6c01 f1a56262
expect  : ea47a203 6cae3e14 c89524d9 132c8fd0 8c68604c 28054808 39cf103e a15190f0
FAIL: vector row load returned the wrong row
```

Expected: `got` == `expect`, `ok`, exit 0 — which is what the CPU-device
control prints (`ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro`).

The `got` row is a self-consistent table row of a *different* index: the
stage-1 row load returned wrong data and the chain diverged deterministically.

Knobs: `-DCHAIN=n` (hash→load stages, default 2), `-DBLAKE_ROUNDS_N=n`
(unrolled rounds per hash, default 2). Pressure matrix observed on this stack:

| config | JIT | result |
|--------|-----|--------|
| CHAIN=2 ROUNDS=2/4/7 | -O3, -O1, -O0 | FAIL (wrong row), both front ends |
| CHAIN=2 ROUNDS=1 | -O3 | 2026.0: HANG (bcs/ccs engine resets); nightly: ok |
| CHAIN=1 ROUNDS=7 | -O3 | 2026.0: compute exception `UR_RESULT_ERROR_IN_EVENT_LIST_EXEC_STATUS` (ccs reset); nightly: ok* |
| CHAIN=1 ROUNDS=7 | -O1/-O0 | 2026.0: ok |
| any, CPU device | n/a | ok |

\* The nightly "ok" at CHAIN=1 is a false negative: June 2026 nightlies emit
bitcast-heavy SPIR-V (~60 % larger for identical source) that defeats IGC's
loop unrolling on larger kernels, demoting the message arrays to scratch and
lowering register pressure below the trigger threshold (filed separately as a
front-end regression). The repro therefore unrolls rounds with compile-time
round numbers so pressure does not depend on the unroller.

## Level Zero note

The same kernels reproduce on Level Zero, but a faulting variant can hang
without any error reaching the application (`event::wait()` blocks forever
while the driver logs an LR-job cleanup). OpenCL is the practical backend for
collecting results.

## Attached diagnostics (what IGC triage usually asks for)

- **Shader dumps** [igc_dumps.zip](https://github.com/user-attachments/files/28886957/igc_dumps.zip) (`IGC_ShaderDumpEnable=1 IGC_DumpToCustomDir=...` while
  running the failing default config, JIT `-O3`):
  - `igc_dumps_2026.0/` — repro built with oneAPI 2026.0 icpx
  - `igc_dumps_nightly/` — same source built with intel/llvm nightly-2026-06-12
  - Each contains, per OpenCL program: input `.spv`, `before/afterUnification.ll`,
    `optimized.ll`, `codegen.ll`, vISA (`.visaasm`) and final ISA (`.asm`,
    `.isaasm`), `options.txt` (`-O3`) and `internal_options.txt`
    (note `-cl-intel-greater-than-4GB-buffer-required` is present).
  - The miscompiled **probe kernel** is
    `_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE0_clES2_EUlNS0_7nd_itemILi1EEEE_`
    (the second `submit` in `main`); the table-build kernel
    (`...handlerEE_cl...`, separate program) compiles correctly — same
    `table + uint64(index) * 8` addressing helper, lower register pressure.
- **Compilation mode**: JIT from SPIR-V (no AOT/`-fsycl-targets`); device
  options only `SYCL_PROGRAM_COMPILE_OPTIONS=-O3` (also reproduces with
  `-O0`/`-O1`/`-O2`).
- **Stack**: Ubuntu 24.04 userspace, host kernel 6.17.0-35 (`xe` driver),
  intel-opencl-icd 26.05.37020.3, libigc2 2.28.4 / libigdfcl2 2.28.4
  (IGC git `6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124`), Level Zero loader
  libze1; GPU Arc B580, PCI 0xe20b, GuC firmware 70.65.x.
- **Determinism**: allocation base is stable on this host
  (`0xffffd556b0800000`, low32 `0xb0800000`), the failing index (139774) and
  the wrong row are bit-identical run-to-run and across both front ends.
- dmesg for the faulting/hanging variants: `GT0: Engine reset:
  engine_class=ccs` (exception form), repeated `bcs`/`ccs` resets +
  devcoredumps (hang form).

## Related (checked, not duplicates)

- intel-graphics-compiler#412 — Xe2 RA regression, but performance/spill only,
  and its "good" baseline (IGC 2.28.4) already contains this wrong-code bug.
- kokkos/kokkos#7116 — prior art for IGC zeroing the top 32 bits of 64-bit
  pointer arithmetic in device kernels (oneAPI 2023.1 / PVC).
- intel/llvm#21741 — Xe2 wrong results, but ESIMD DPAS + large-project
  device-image composition.
- intel/compute-runtime#912 — B580 hang, `uchar` load/store pattern, no
  64-bit-address component.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong-code on Xe2 (Arc B580): 32-byte vector USM load returns the wrong row; broken emulated 64-bit addressing under register pressure #414

Summary

Environment

Reproducer

Level Zero note

Attached diagnostics (what IGC triage usually asks for)

Related (checked, not duplicates)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

config	JIT	result
CHAIN=2 ROUNDS=2/4/7	-O3, -O1, -O0	FAIL (wrong row), both front ends
CHAIN=2 ROUNDS=1	-O3	2026.0: HANG (bcs/ccs engine resets); nightly: ok
CHAIN=1 ROUNDS=7	-O3	2026.0: compute exception `UR_RESULT_ERROR_IN_EVENT_LIST_EXEC_STATUS` (ccs reset); nightly: ok*
CHAIN=1 ROUNDS=7	-O1/-O0	2026.0: ok
any, CPU device	n/a	ok

Wrong-code on Xe2 (Arc B580): 32-byte vector USM load returns the wrong row; broken emulated 64-bit addressing under register pressure #414

Description

Summary

Environment

Reproducer

Level Zero note

Attached diagnostics (what IGC triage usually asks for)

Related (checked, not duplicates)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions