Skip to content

Trampolines#1777

Open
Bike wants to merge 14 commits into
mainfrom
trampolines
Open

Trampolines#1777
Bike wants to merge 14 commits into
mainfrom
trampolines

Conversation

@Bike
Copy link
Copy Markdown
Member

@Bike Bike commented May 27, 2026

Adds a facility for giving bytecodes native "trampolines" so that they are visible by name in external backtraces (e.g. perf, gdb). Supersedes #1765. drmeister has done the work here, just filing a PR so I can review it.

Christian Schafmeisterr and others added 11 commits May 26, 2026 17:48
Arena-based trampolines (hand-coded x86_64), sampling profiler,
flame graph generation, trampoline-aware backtraces, command-line
extensions, and snapshot save/load support. Excludes bytecode
interpreter changes (computed gotos, VMDynRecord dynamic binding
stack) which remain on the interpreter-work branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also demangle C++ function names
- New trampoline_aarch64.h paralleling trampoline_x86_64.h with
  hand-coded bytecode (36B) and GF (32B) trampolines using LDR
  from literal pool. Shared CIE and per-kind FDEs with full DWARF
  CFI for unwinding. Same instructions for Linux arm64 and Apple
  Silicon; macOS W^X support still needs MAP_JIT in ExecutableArena.
- Wire aarch64 templates into trampolineWork.cc via #elif __aarch64__.
- Demangle C++ symbols in sampling_profiler symbolicate_one via
  abi::__cxa_demangle for readable flame graph frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At profile-start, snapshot /proc/self/maps (Linux) or dyld image list
(macOS) into a sorted executable-range cache. The SIGPROF handler
binary-searches each saved_rip during the frame-pointer walk: if the
address isn't in any executable mapping, the chain is broken (frame
compiled without -fno-omit-frame-pointer) and the walk stops cleanly
instead of following garbage pointers.

New JIT/arena pages registered dynamically via
sampling_profiler_add_executable_range(), called from
ExecutableArena::allocate() so trampolines mmap'd during profiling
are immediately recognized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps a body form with sampling profiler start/stop and writes a
flame graph SVG on completion. Profiler is stopped and reset via
unwind-protect on any exit path.

  (ext:with-flame-profile ("/tmp/my.svg" :rate 197 :title "test")
    (expensive-work))

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .dif file stored hash used to have to match the calculated hash
of the clasp_gc.sif file.  That's too restrictive
Copy link
Copy Markdown
Member Author

@Bike Bike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly seems ok. some of the comments are false or misleading. I didn't read over the sampling profiler or flame graphs in great detail, but they're optional, so I'm not too concerned about those.

of course, tests are failing, so that's no good

/* Define a C++ GMP wrapper */

#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wdeprecated-literal-operator"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment as to what this is for would be appreciated

* That profiler measures user-annotated regions; this one periodically
* snapshots whatever code is running.
*
* See Phase 4 / Phase 5 for post-mortem symbolication and flame-graph
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What phases are this referring to? (Probably phases of a Claude plan, but that's not apparent from the code, so.)

// Populate the calling thread's stack bounds for later frame-walking.
// Must be called from a non-signal context. sampling_profiler_start
// calls this automatically for the calling thread; other threads that
// should be fully profiled need to call ext:profile-register-thread
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they do, since mpPackage.h does for all threads?

// name participates in the same "wrapper:name" -> unique-name substitution
// (the suffix "_end" survives unchanged), so each trampoline gets its own
// matching end marker symbol.
__attribute__((used, noinline)) void WRAPPER_END_MARKER() asm("wrapper:name_end");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're hardcoding the instructions, none of this should be necessary, and i don't see it being used


return_type bytecode_call(uint64_t pc, void *closure, uint64_t nargs, void **args);
// Indirect through a global function pointer rather than calling bytecode_call
// directly. Each compiled trampoline's call topology is now identical
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"now identical" - claude seems prone to putting progress reports in comments, but they'd be more appropriate in commits

* the preceding CIE (= cie_size + 4).
* - FDE's PC range = code_size.
* Because every slot has the same layout, the bytes are identical across
* slots and a single memcpy is all that's required.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phrased misleadingly - each copy needs to be patched with some addresses, so in the end they will have different bytes. And while there's only one explicit memcpy call, the template is copied into a std::vector first, which probably does the like of memcpy.

bool install_template(const uint8_t* tramp_bytes, size_t tramp_size,
const uint8_t* cie_bytes, size_t cie_len,
const uint8_t* fde_bytes, size_t fde_len) {
std::lock_guard<std::mutex> g(_init_lock);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if _initialized is true first (presumably with acquire) should be faster than grabbing a lock every time we compile anything.


(defun join (list sep)
(with-output-to-string (out)
(loop for cell on list
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(loop for (item . rest) on list do (write-string item out) when rest do (write-char sep out))

}

// Read the interrupted instruction pointer out of the context structure.
// x86_64 only for now — portable to arm64 when we need it.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... not portable?


thread_local ThreadStackBounds t_stack_bounds{0, 0, false};

static void populate_stack_bounds_for_this_thread() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already do almost exactly this for scanning in the mmtk branch. so I guess that'll have to get merged eventually

Christian Schafmeisterr and others added 3 commits June 1, 2026 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants