Skip to content

Fix executable save-lisp-and-die snapshot on aarch64 Linux#1779

Open
dg1sbg wants to merge 1 commit into
clasp-developers:mainfrom
dg1sbg:fix/aarch64-snapshot-objcopy
Open

Fix executable save-lisp-and-die snapshot on aarch64 Linux#1779
dg1sbg wants to merge 1 commit into
clasp-developers:mainfrom
dg1sbg:fix/aarch64-snapshot-objcopy

Conversation

@dg1sbg
Copy link
Copy Markdown
Contributor

@dg1sbg dg1sbg commented May 30, 2026

Problem

On aarch64 Linux, (save-lisp-and-die … :executable t) does not produce a working executable. The snapshot data is written correctly, but the binary-object wrap step fails:

Creating binary object from snapshot...
objcopy: architecture i386 unknown
/usr/bin/ld.gold: error: /tmp/ss-XXXXXX: file is empty
clang++: error: linker command failed with exit code 1

Two bugs in snapshot_save_impl (src/gctools/snapshotSaveLoad.cc), in the _TARGET_OS_LINUX _Executable path:

  1. Hardcoded x86-64 objcopy target. The objcopy that wraps the snapshot blob into an object hardcodes --output-target elf64-x86-64 --binary-architecture i386. On aarch64, objcopy rejects this (architecture i386 unknown), leaving the mkstemp object file empty, which ld.gold then rejects (file is empty). The Darwin path uses -sectcreate and is unaffected; x86-64 Linux happens to match the hardcoded values, which is why CI (x86-64 Linux + aarch64 macOS) never caught it.

  2. system() return value mis-checked. Both if (system(cmd.c_str()) < 0) guards only detect fork failure (system() returns -1); they miss a non-zero exit status. So the failing objcopy and the subsequent failing link are ignored and execution falls through to exit(0) — the build reports success with no artifact produced.

Fix

  • Select the objcopy --output-target / --binary-architecture by build architecture: __x86_64__elf64-x86-64/i386 (unchanged), __aarch64__elf64-littleaarch64/aarch64, #error otherwise.
  • Make both system() checks detect any non-zero result and exit(1) with the command and code, so a failed snapshot can no longer masquerade as success.

Testing

aarch64 Linux (Ubuntu 24.04, clang/llvm 18, boehmprecise variant): ninja snapshot-boehmprecise now produces a working ~212 MB ELF 64-bit … ARM aarch64 PIE executable that boots from its embedded snapshot and runs (smoke: *features* contains :clasp, fib(25)=75025, (compile …)+funcall work). x86-64 Linux behavior is unchanged (identical target/arch values as before).

🤖 Generated with Claude Code

On aarch64 Linux, (save-lisp-and-die ... :executable t) failed to produce a
working executable. Two bugs in snapshot_save_impl (snapshotSaveLoad.cc), in
the _TARGET_OS_LINUX _Executable path:

1. The objcopy that wraps the snapshot blob into an object hardcoded
   --output-target elf64-x86-64 --binary-architecture i386. On aarch64 objcopy
   rejects this ("architecture i386 unknown"), leaving the mkstemp object empty,
   which ld.gold then rejects ("file is empty"). Select the output target and
   binary architecture by build arch instead (elf64-littleaarch64/aarch64 on ARM64).

2. Both `if (system(cmd) < 0)` checks only detect fork failure, not a non-zero
   command exit, so the failing objcopy and link were ignored and execution fell
   through to exit(0) -- the build reported success with no artifact. Check for any
   non-zero result and exit(1) with the command and code.

Verified on aarch64 Linux (Ubuntu 24.04, llvm 18, boehmprecise): ninja
snapshot-boehmprecise now yields a working ARM64 PIE executable that boots from
its embedded snapshot. x86-64 Linux behavior is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dg1sbg
Copy link
Copy Markdown
Contributor Author

dg1sbg commented Jun 1, 2026

The two failing checks here — cando/macos-latest/native and clasp/macos-latest/native — are not caused by this PR. They fail in the mp regression suite at PROCESS-ABORT-1, the same way every macOS-arm64 native run on main currently does. The other six jobs (Ubuntu native, both bytecode variants) pass, and this PR only touches the aarch64-Linux snapshot/objcopy path — exercised by the green Linux jobs.

Root cause is tracked in #1782: on macOS arm64, Clasp's native (cleavir-emitted) objects carry compact-unwind entries but no __eh_frame, so libunwind can't traverse native frames between a C++ throw and its catch. mp:abort-process depends on exactly that unwind, so it reaches std::terminate → abort instead of being caught in Process_O::runInner. Linux stays green because ELF retains .eh_frame.

So the red here is pre-existing and orthogonal to this change; it can be reviewed on the green jobs. (#1781 is a separate, related Apple-Silicon fix — JIT W^X literal writes — but does not address #1782.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant