Summary
On Intel Arc B580 (Battlemage / Xe2), IGC JIT-compiles a small SYCL kernel so
that a sycl::vec<uint32_t, 8> load from a 2 GiB malloc_device USM
allocation at a runtime 64-bit-derived index silently returns data from the
wrong address. Depending on configuration the same defect also shows as a
compute exception (ccs engine reset) or a hang.
The minimized reproducer (attached, ~230 lines) fails at every JIT
optimization level (SYCL_PROGRAM_COMPILE_OPTIONS -O0..-O3) and with
both front ends — oneAPI 2026.0.0 release icpx and intel/llvm
nightly-2026-06-12 clang++ — producing the identical wrong row, so the
defect is in IGC code generation, not in the front end. The same binary
passes on the OpenCL CPU device (kernel logic is proven correct).
First hit in production kernels (MoneroOcean mo-miner, autolykos2): there,
table reads return wrong data for ~41–50 % of indices — matching the fraction
of offsets into the 2 GiB table whose 64-bit address calculation needs a carry
into the high dword from the allocation base (Xe2 has no native 64-bit int
ops). vISA inspection shows the address arithmetic of this kernel's loads
breaking under high register pressure, while sibling low-pressure kernels with
the same source helper compile correctly. Forcing a carry-safe allocation
placement (base low32 + 2 GiB <= 4 GiB, so no in-bounds offset ever carries)
still produces wrong reads, so it is not merely a dropped carry.
Environment
- GPU: Intel Arc B580 (Battlemage / Xe2, PCI 0xe20b)
- Ubuntu 24.04 userspace (containerized), host kernel 6.17.0-35,
xe driver
- OpenCL runtime: NEO 26.05.37020.3, libigc2 2.28.4
(IGC git-hash 6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124)
- Front ends (both reproduce): oneAPI DPC++ 2026.0.0 (2026.0.0.20260331);
intel/llvm nightly-2026-06-12 (24322f9e1ad3)
Reproducer
repro.cpp (attached in bug_report1_repro.zip): builds a 2 GiB device table with a deterministic
per-word pattern, then 64 work items each run two chained stages of
{2 statically-unrolled BLAKE2b rounds → derive row index → load 32-byte row as
sycl::vec<uint32_t,8> → fold row into next stage}. The final row of work
item 0 is compared against the host recomputation.
icpx -fsycl -O2 repro.cpp -o repro
ONEAPI_DEVICE_SELECTOR=opencl:gpu SYCL_PROGRAM_COMPILE_OPTIONS=-O3 ./repro
Actual (B580, identical wrong row from 2026.0 and nightly builds):
index : 139774
got : 31611897 791ba32c 5b4bd021 4b073452 7fd57710 d6e40eec 4bdb6c01 f1a56262
expect : ea47a203 6cae3e14 c89524d9 132c8fd0 8c68604c 28054808 39cf103e a15190f0
FAIL: vector row load returned the wrong row
Expected: got == expect, ok, exit 0 — which is what the CPU-device
control prints (ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro).
The got row is a self-consistent table row of a different index: the
stage-1 row load returned wrong data and the chain diverged deterministically.
Knobs: -DCHAIN=n (hash→load stages, default 2), -DBLAKE_ROUNDS_N=n
(unrolled rounds per hash, default 2). Pressure matrix observed on this stack:
| config |
JIT |
result |
| CHAIN=2 ROUNDS=2/4/7 |
-O3, -O1, -O0 |
FAIL (wrong row), both front ends |
| CHAIN=2 ROUNDS=1 |
-O3 |
2026.0: HANG (bcs/ccs engine resets); nightly: ok |
| CHAIN=1 ROUNDS=7 |
-O3 |
2026.0: compute exception UR_RESULT_ERROR_IN_EVENT_LIST_EXEC_STATUS (ccs reset); nightly: ok* |
| CHAIN=1 ROUNDS=7 |
-O1/-O0 |
2026.0: ok |
| any, CPU device |
n/a |
ok |
* The nightly "ok" at CHAIN=1 is a false negative: June 2026 nightlies emit
bitcast-heavy SPIR-V (~60 % larger for identical source) that defeats IGC's
loop unrolling on larger kernels, demoting the message arrays to scratch and
lowering register pressure below the trigger threshold (filed separately as a
front-end regression). The repro therefore unrolls rounds with compile-time
round numbers so pressure does not depend on the unroller.
Level Zero note
The same kernels reproduce on Level Zero, but a faulting variant can hang
without any error reaching the application (event::wait() blocks forever
while the driver logs an LR-job cleanup). OpenCL is the practical backend for
collecting results.
Attached diagnostics (what IGC triage usually asks for)
- Shader dumps igc_dumps.zip (
IGC_ShaderDumpEnable=1 IGC_DumpToCustomDir=... while
running the failing default config, JIT -O3):
igc_dumps_2026.0/ — repro built with oneAPI 2026.0 icpx
igc_dumps_nightly/ — same source built with intel/llvm nightly-2026-06-12
- Each contains, per OpenCL program: input
.spv, before/afterUnification.ll,
optimized.ll, codegen.ll, vISA (.visaasm) and final ISA (.asm,
.isaasm), options.txt (-O3) and internal_options.txt
(note -cl-intel-greater-than-4GB-buffer-required is present).
- The miscompiled probe kernel is
_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE0_clES2_EUlNS0_7nd_itemILi1EEEE_
(the second submit in main); the table-build kernel
(...handlerEE_cl..., separate program) compiles correctly — same
table + uint64(index) * 8 addressing helper, lower register pressure.
- Compilation mode: JIT from SPIR-V (no AOT/
-fsycl-targets); device
options only SYCL_PROGRAM_COMPILE_OPTIONS=-O3 (also reproduces with
-O0/-O1/-O2).
- Stack: Ubuntu 24.04 userspace, host kernel 6.17.0-35 (
xe driver),
intel-opencl-icd 26.05.37020.3, libigc2 2.28.4 / libigdfcl2 2.28.4
(IGC git 6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124), Level Zero loader
libze1; GPU Arc B580, PCI 0xe20b, GuC firmware 70.65.x.
- Determinism: allocation base is stable on this host
(0xffffd556b0800000, low32 0xb0800000), the failing index (139774) and
the wrong row are bit-identical run-to-run and across both front ends.
- dmesg for the faulting/hanging variants:
GT0: Engine reset: engine_class=ccs (exception form), repeated bcs/ccs resets +
devcoredumps (hang form).
Related (checked, not duplicates)
Summary
On Intel Arc B580 (Battlemage / Xe2), IGC JIT-compiles a small SYCL kernel so
that a
sycl::vec<uint32_t, 8>load from a 2 GiBmalloc_deviceUSMallocation at a runtime 64-bit-derived index silently returns data from the
wrong address. Depending on configuration the same defect also shows as a
compute exception (ccs engine reset) or a hang.
The minimized reproducer (attached, ~230 lines) fails at every JIT
optimization level (
SYCL_PROGRAM_COMPILE_OPTIONS-O0..-O3) and withboth front ends — oneAPI 2026.0.0 release
icpxand intel/llvmnightly-2026-06-12clang++— producing the identical wrong row, so thedefect is in IGC code generation, not in the front end. The same binary
passes on the OpenCL CPU device (kernel logic is proven correct).
First hit in production kernels (MoneroOcean mo-miner, autolykos2): there,
table reads return wrong data for ~41–50 % of indices — matching the fraction
of offsets into the 2 GiB table whose 64-bit address calculation needs a carry
into the high dword from the allocation base (Xe2 has no native 64-bit int
ops). vISA inspection shows the address arithmetic of this kernel's loads
breaking under high register pressure, while sibling low-pressure kernels with
the same source helper compile correctly. Forcing a carry-safe allocation
placement (base low32 + 2 GiB <= 4 GiB, so no in-bounds offset ever carries)
still produces wrong reads, so it is not merely a dropped carry.
Environment
xedriver(IGC git-hash
6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124)intel/llvm nightly-2026-06-12 (
24322f9e1ad3)Reproducer
repro.cpp(attached in bug_report1_repro.zip): builds a 2 GiB device table with a deterministicper-word pattern, then 64 work items each run two chained stages of
{2 statically-unrolled BLAKE2b rounds → derive row index → load 32-byte row as
sycl::vec<uint32_t,8>→ fold row into next stage}. The final row of workitem 0 is compared against the host recomputation.
Actual (B580, identical wrong row from 2026.0 and nightly builds):
Expected:
got==expect,ok, exit 0 — which is what the CPU-devicecontrol prints (
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro).The
gotrow is a self-consistent table row of a different index: thestage-1 row load returned wrong data and the chain diverged deterministically.
Knobs:
-DCHAIN=n(hash→load stages, default 2),-DBLAKE_ROUNDS_N=n(unrolled rounds per hash, default 2). Pressure matrix observed on this stack:
UR_RESULT_ERROR_IN_EVENT_LIST_EXEC_STATUS(ccs reset); nightly: ok** The nightly "ok" at CHAIN=1 is a false negative: June 2026 nightlies emit
bitcast-heavy SPIR-V (~60 % larger for identical source) that defeats IGC's
loop unrolling on larger kernels, demoting the message arrays to scratch and
lowering register pressure below the trigger threshold (filed separately as a
front-end regression). The repro therefore unrolls rounds with compile-time
round numbers so pressure does not depend on the unroller.
Level Zero note
The same kernels reproduce on Level Zero, but a faulting variant can hang
without any error reaching the application (
event::wait()blocks foreverwhile the driver logs an LR-job cleanup). OpenCL is the practical backend for
collecting results.
Attached diagnostics (what IGC triage usually asks for)
IGC_ShaderDumpEnable=1 IGC_DumpToCustomDir=...whilerunning the failing default config, JIT
-O3):igc_dumps_2026.0/— repro built with oneAPI 2026.0 icpxigc_dumps_nightly/— same source built with intel/llvm nightly-2026-06-12.spv,before/afterUnification.ll,optimized.ll,codegen.ll, vISA (.visaasm) and final ISA (.asm,.isaasm),options.txt(-O3) andinternal_options.txt(note
-cl-intel-greater-than-4GB-buffer-requiredis present)._ZTSZZ4mainENKUlRN4sycl3_V17handlerEE0_clES2_EUlNS0_7nd_itemILi1EEEE_(the second
submitinmain); the table-build kernel(
...handlerEE_cl..., separate program) compiles correctly — sametable + uint64(index) * 8addressing helper, lower register pressure.-fsycl-targets); deviceoptions only
SYCL_PROGRAM_COMPILE_OPTIONS=-O3(also reproduces with-O0/-O1/-O2).xedriver),intel-opencl-icd 26.05.37020.3, libigc2 2.28.4 / libigdfcl2 2.28.4
(IGC git
6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124), Level Zero loaderlibze1; GPU Arc B580, PCI 0xe20b, GuC firmware 70.65.x.
(
0xffffd556b0800000, low320xb0800000), the failing index (139774) andthe wrong row are bit-identical run-to-run and across both front ends.
GT0: Engine reset: engine_class=ccs(exception form), repeatedbcs/ccsresets +devcoredumps (hang form).
Related (checked, not duplicates)
and its "good" baseline (IGC 2.28.4) already contains this wrong-code bug.
pointer arithmetic in device kernels (oneAPI 2023.1 / PVC).
device-image composition.
ucharload/store pattern, no64-bit-address component.