Skip to content

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 #413

@eternallyproud

Description

@eternallyproud

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)

Summary

On an Intel GPU without native FP64 (Alder Lake-P iGPU, device 0xa7a0), where double is software-emulated by IGC, a oneapi::dpl::reduce over a small custom accumulator struct ({double, double, int}) returns a wrong result at -O2 while being correct at -O0 and correct on the OpenCL CPU device (which has native FP64). The reduced running_square_deviations field comes out 0 instead of 5.

This is a silent miscompilation — wrong data, no error, no crash.

Environment

GPU Intel Alder Lake-P iGPU, device id 0xa7a0 (no native FP64)
Compiler Intel oneAPI DPC++/C++ Compiler 2026.0.0
FP64 mode software emulation (OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1)
OS Ubuntu 22.04.5 (WSL2), kernel 6.6

Reproduced on two independent stacks (so it is not fixed on current IGC):

  • IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578
  • IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308

Reproducer (self-contained, ~55 lines, SYCL + oneDPL)

#include <sycl/sycl.hpp>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/numeric>
#include <oneapi/dpl/iterator>
#include <cstdio>
using size_type = int;
struct var_std {                 // numerically-stable parallel variance accumulator
  double running_sum;
  double running_square_deviations;
  size_type count;
  var_std(double t = 0, double s = 0, size_type n = 0)
      : running_sum(t), running_square_deviations(s), count(n) {}
  var_std operator+(var_std const& rhs) const {
    auto m = count, n = rhs.count;
    if (m == 0) return rhs;
    if (n == 0) return *this;
    double tm = running_sum, tn = rhs.running_sum;
    double sm = running_square_deviations, sn = rhs.running_square_deviations;
    double tmn  = tm + tn;
    double diff = ((double)n / m) * tm - tn;
    double smn  = sm + sn + ((double)m / n) / (m + n) * diff * diff;
    return {tmn, smn, m + n};
  }
};
struct to_var_std { var_std operator()(double const& v) const { return {v, 0.0, 1}; } };
struct plus_op   { template <class T> T operator()(T const& a, T const& b) const { return a + b; } };

int main(int argc, char** argv) {
  int N = argc > 1 ? atoi(argv[1]) : 4;                       // {1,2,3,4}
  sycl::queue q{sycl::default_selector_v};                    // honors ONEAPI_DEVICE_SELECTOR
  std::printf("device: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
  auto counting = oneapi::dpl::counting_iterator<size_type>(0);
  auto values   = oneapi::dpl::make_transform_iterator(
      counting, [](size_type i) { return (double)(i + 1); });
  auto acc_it   = oneapi::dpl::make_transform_iterator(values, to_var_std{});
  var_std r = oneapi::dpl::reduce(
      oneapi::dpl::execution::make_device_policy(q),
      acc_it, acc_it + N, var_std{0, 0, 0}, plus_op{});
  std::printf("sqdev = %g   (expected 5)%s\n",
              r.running_square_deviations,
              r.running_square_deviations == 0 ? "   <-- MISCOMPILE" : "");
  return r.running_square_deviations == 0 ? 1 : 0;
}

Build / run

# FP64 software emulation must be enabled on an FP64-less iGPU:
export NEOReadDebugKeys=1 OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1
# Cold compiler caches so IGC actually recompiles each run:
export NEO_CACHE_PERSISTENT=0 SYCL_CACHE_PERSISTENT=0
FLAGS="-fsycl -fsycl-default-sub-group-size=32 -std=gnu++17"

icpx $FLAGS -O0 repro.cpp -o repro_O0 && ./repro_O0
icpx $FLAGS -O1 repro.cpp -o repro_O1 && ./repro_O1
icpx $FLAGS -O2 repro.cpp -o repro_O2 && ./repro_O2
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro_O2

Observed (IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578)

device: Intel(R) Graphics [0xa7a0]
GPU -O0:  sqdev = 5   (expected 5)
GPU -O1:  sqdev = 0   (expected 5)   <-- MISCOMPILE
GPU -O2:  sqdev = 0   (expected 5)   <-- MISCOMPILE
CPU -O2:  sqdev = 5   (expected 5)

Observed (IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308)

device: Intel(R) Graphics [0xa7a0]
GPU -O0:  sqdev = 5   (expected 5)
GPU -O1:  sqdev = -nan   (expected 5)   <-- MISCOMPILE
GPU -O2:  sqdev = -nan   (expected 5)   <-- MISCOMPILE
CPU -O2:  sqdev = 5   (expected 5)

Triage already done

  • Correct at -O0, wrong at -O2 on the GPU; the wrong field is exactly 0.
  • Same source is correct on the OpenCL CPU device at every optimization level → the device program is well-defined; the defect is in the GPU FP64-emulation + optimizer path.
  • It is the FP64 emulation specifically: changing doublefloat (native) makes -O2 correct; an accumulator that uses only a single emulated + is also correct. Only the emulation-heavy merge (several emulated double divides/muls) is miscompiled.
  • From IGC_ShaderDumpEnableAll dumps: the double ops survive the entire middle-end intact and are lowered to integer software emulation at CG_after_PreCompiledFuncImport. The wrong value appears after that, in the CodeGen memory-optimization stage:
    • IGC_DisableMemOpt=1 changes the result from 0 to -nan — i.e. the running_square_deviations field value never reaches the output store; MemOpt's load/store forwarding turns that uninitialized read into 0.
    • Disabling individual passes does not restore correctness: DisableCustomUnsafeOpt, DisableFastMathConstantHandling, OCLEnableReassociate=0, DisableMatchMad, DisableLoopUnroll, EarlyCSE off, SelectiveLoopUnrollForDPEmu=0, EmulationFunctionControl / InlinedEmulationThreshold. Only -O0 does.

So the trigger appears to be IGC's -O2 CodeGen memory optimization mishandling the store/load of an emulated-FP64 struct field produced by PreCompiledFuncImport.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions