Skip to content

Wrong-code on Xe2 (Arc B580): 32-byte vector USM load returns the wrong row; broken emulated 64-bit addressing under register pressure #414

@MoneroOcean

Description

@MoneroOcean

Summary

On Intel Arc B580 (Battlemage / Xe2), IGC JIT-compiles a small SYCL kernel so
that a sycl::vec<uint32_t, 8> load from a 2 GiB malloc_device USM
allocation at a runtime 64-bit-derived index silently returns data from the
wrong address
. Depending on configuration the same defect also shows as a
compute exception (ccs engine reset) or a hang.

The minimized reproducer (attached, ~230 lines) fails at every JIT
optimization level (SYCL_PROGRAM_COMPILE_OPTIONS -O0..-O3) and with
both front ends — oneAPI 2026.0.0 release icpx and intel/llvm
nightly-2026-06-12 clang++ — producing the identical wrong row, so the
defect is in IGC code generation, not in the front end. The same binary
passes on the OpenCL CPU device (kernel logic is proven correct).

First hit in production kernels (MoneroOcean mo-miner, autolykos2): there,
table reads return wrong data for ~41–50 % of indices — matching the fraction
of offsets into the 2 GiB table whose 64-bit address calculation needs a carry
into the high dword from the allocation base (Xe2 has no native 64-bit int
ops). vISA inspection shows the address arithmetic of this kernel's loads
breaking under high register pressure, while sibling low-pressure kernels with
the same source helper compile correctly. Forcing a carry-safe allocation
placement (base low32 + 2 GiB <= 4 GiB, so no in-bounds offset ever carries)
still produces wrong reads, so it is not merely a dropped carry.

Environment

  • GPU: Intel Arc B580 (Battlemage / Xe2, PCI 0xe20b)
  • Ubuntu 24.04 userspace (containerized), host kernel 6.17.0-35, xe driver
  • OpenCL runtime: NEO 26.05.37020.3, libigc2 2.28.4
    (IGC git-hash 6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124)
  • Front ends (both reproduce): oneAPI DPC++ 2026.0.0 (2026.0.0.20260331);
    intel/llvm nightly-2026-06-12 (24322f9e1ad3)

Reproducer

repro.cpp (attached in bug_report1_repro.zip): builds a 2 GiB device table with a deterministic
per-word pattern, then 64 work items each run two chained stages of
{2 statically-unrolled BLAKE2b rounds → derive row index → load 32-byte row as
sycl::vec<uint32_t,8> → fold row into next stage}. The final row of work
item 0 is compared against the host recomputation.

icpx -fsycl -O2 repro.cpp -o repro
ONEAPI_DEVICE_SELECTOR=opencl:gpu SYCL_PROGRAM_COMPILE_OPTIONS=-O3 ./repro

Actual (B580, identical wrong row from 2026.0 and nightly builds):

index   : 139774
got     : 31611897 791ba32c 5b4bd021 4b073452 7fd57710 d6e40eec 4bdb6c01 f1a56262
expect  : ea47a203 6cae3e14 c89524d9 132c8fd0 8c68604c 28054808 39cf103e a15190f0
FAIL: vector row load returned the wrong row

Expected: got == expect, ok, exit 0 — which is what the CPU-device
control prints (ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro).

The got row is a self-consistent table row of a different index: the
stage-1 row load returned wrong data and the chain diverged deterministically.

Knobs: -DCHAIN=n (hash→load stages, default 2), -DBLAKE_ROUNDS_N=n
(unrolled rounds per hash, default 2). Pressure matrix observed on this stack:

config JIT result
CHAIN=2 ROUNDS=2/4/7 -O3, -O1, -O0 FAIL (wrong row), both front ends
CHAIN=2 ROUNDS=1 -O3 2026.0: HANG (bcs/ccs engine resets); nightly: ok
CHAIN=1 ROUNDS=7 -O3 2026.0: compute exception UR_RESULT_ERROR_IN_EVENT_LIST_EXEC_STATUS (ccs reset); nightly: ok*
CHAIN=1 ROUNDS=7 -O1/-O0 2026.0: ok
any, CPU device n/a ok

* The nightly "ok" at CHAIN=1 is a false negative: June 2026 nightlies emit
bitcast-heavy SPIR-V (~60 % larger for identical source) that defeats IGC's
loop unrolling on larger kernels, demoting the message arrays to scratch and
lowering register pressure below the trigger threshold (filed separately as a
front-end regression). The repro therefore unrolls rounds with compile-time
round numbers so pressure does not depend on the unroller.

Level Zero note

The same kernels reproduce on Level Zero, but a faulting variant can hang
without any error reaching the application (event::wait() blocks forever
while the driver logs an LR-job cleanup). OpenCL is the practical backend for
collecting results.

Attached diagnostics (what IGC triage usually asks for)

  • Shader dumps igc_dumps.zip (IGC_ShaderDumpEnable=1 IGC_DumpToCustomDir=... while
    running the failing default config, JIT -O3):
    • igc_dumps_2026.0/ — repro built with oneAPI 2026.0 icpx
    • igc_dumps_nightly/ — same source built with intel/llvm nightly-2026-06-12
    • Each contains, per OpenCL program: input .spv, before/afterUnification.ll,
      optimized.ll, codegen.ll, vISA (.visaasm) and final ISA (.asm,
      .isaasm), options.txt (-O3) and internal_options.txt
      (note -cl-intel-greater-than-4GB-buffer-required is present).
    • The miscompiled probe kernel is
      _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE0_clES2_EUlNS0_7nd_itemILi1EEEE_
      (the second submit in main); the table-build kernel
      (...handlerEE_cl..., separate program) compiles correctly — same
      table + uint64(index) * 8 addressing helper, lower register pressure.
  • Compilation mode: JIT from SPIR-V (no AOT/-fsycl-targets); device
    options only SYCL_PROGRAM_COMPILE_OPTIONS=-O3 (also reproduces with
    -O0/-O1/-O2).
  • Stack: Ubuntu 24.04 userspace, host kernel 6.17.0-35 (xe driver),
    intel-opencl-icd 26.05.37020.3, libigc2 2.28.4 / libigdfcl2 2.28.4
    (IGC git 6b2da2b8a2661adeaebaacbbbf7f35b5ee9e8124), Level Zero loader
    libze1; GPU Arc B580, PCI 0xe20b, GuC firmware 70.65.x.
  • Determinism: allocation base is stable on this host
    (0xffffd556b0800000, low32 0xb0800000), the failing index (139774) and
    the wrong row are bit-identical run-to-run and across both front ends.
  • dmesg for the faulting/hanging variants: GT0: Engine reset: engine_class=ccs (exception form), repeated bcs/ccs resets +
    devcoredumps (hang form).

Related (checked, not duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions