Trampolines#1777
Conversation
Arena-based trampolines (hand-coded x86_64), sampling profiler, flame graph generation, trampoline-aware backtraces, command-line extensions, and snapshot save/load support. Excludes bytecode interpreter changes (computed gotos, VMDynRecord dynamic binding stack) which remain on the interpreter-work branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also demangle C++ function names
- New trampoline_aarch64.h paralleling trampoline_x86_64.h with hand-coded bytecode (36B) and GF (32B) trampolines using LDR from literal pool. Shared CIE and per-kind FDEs with full DWARF CFI for unwinding. Same instructions for Linux arm64 and Apple Silicon; macOS W^X support still needs MAP_JIT in ExecutableArena. - Wire aarch64 templates into trampolineWork.cc via #elif __aarch64__. - Demangle C++ symbols in sampling_profiler symbolicate_one via abi::__cxa_demangle for readable flame graph frames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At profile-start, snapshot /proc/self/maps (Linux) or dyld image list (macOS) into a sorted executable-range cache. The SIGPROF handler binary-searches each saved_rip during the frame-pointer walk: if the address isn't in any executable mapping, the chain is broken (frame compiled without -fno-omit-frame-pointer) and the walk stops cleanly instead of following garbage pointers. New JIT/arena pages registered dynamically via sampling_profiler_add_executable_range(), called from ExecutableArena::allocate() so trampolines mmap'd during profiling are immediately recognized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps a body form with sampling profiler start/stop and writes a
flame graph SVG on completion. Profiler is stopped and reset via
unwind-protect on any exit path.
(ext:with-flame-profile ("/tmp/my.svg" :rate 197 :title "test")
(expensive-work))
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .dif file stored hash used to have to match the calculated hash of the clasp_gc.sif file. That's too restrictive
| /* Define a C++ GMP wrapper */ | ||
|
|
||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Wdeprecated-literal-operator" |
There was a problem hiding this comment.
comment as to what this is for would be appreciated
| * That profiler measures user-annotated regions; this one periodically | ||
| * snapshots whatever code is running. | ||
| * | ||
| * See Phase 4 / Phase 5 for post-mortem symbolication and flame-graph |
There was a problem hiding this comment.
What phases are this referring to? (Probably phases of a Claude plan, but that's not apparent from the code, so.)
| // Populate the calling thread's stack bounds for later frame-walking. | ||
| // Must be called from a non-signal context. sampling_profiler_start | ||
| // calls this automatically for the calling thread; other threads that | ||
| // should be fully profiled need to call ext:profile-register-thread |
There was a problem hiding this comment.
I don't think they do, since mpPackage.h does for all threads?
| // name participates in the same "wrapper:name" -> unique-name substitution | ||
| // (the suffix "_end" survives unchanged), so each trampoline gets its own | ||
| // matching end marker symbol. | ||
| __attribute__((used, noinline)) void WRAPPER_END_MARKER() asm("wrapper:name_end"); |
There was a problem hiding this comment.
since we're hardcoding the instructions, none of this should be necessary, and i don't see it being used
|
|
||
| return_type bytecode_call(uint64_t pc, void *closure, uint64_t nargs, void **args); | ||
| // Indirect through a global function pointer rather than calling bytecode_call | ||
| // directly. Each compiled trampoline's call topology is now identical |
There was a problem hiding this comment.
"now identical" - claude seems prone to putting progress reports in comments, but they'd be more appropriate in commits
| * the preceding CIE (= cie_size + 4). | ||
| * - FDE's PC range = code_size. | ||
| * Because every slot has the same layout, the bytes are identical across | ||
| * slots and a single memcpy is all that's required. |
There was a problem hiding this comment.
phrased misleadingly - each copy needs to be patched with some addresses, so in the end they will have different bytes. And while there's only one explicit memcpy call, the template is copied into a std::vector first, which probably does the like of memcpy.
| bool install_template(const uint8_t* tramp_bytes, size_t tramp_size, | ||
| const uint8_t* cie_bytes, size_t cie_len, | ||
| const uint8_t* fde_bytes, size_t fde_len) { | ||
| std::lock_guard<std::mutex> g(_init_lock); |
There was a problem hiding this comment.
Checking if _initialized is true first (presumably with acquire) should be faster than grabbing a lock every time we compile anything.
|
|
||
| (defun join (list sep) | ||
| (with-output-to-string (out) | ||
| (loop for cell on list |
There was a problem hiding this comment.
(loop for (item . rest) on list do (write-string item out) when rest do (write-char sep out))
| } | ||
|
|
||
| // Read the interrupted instruction pointer out of the context structure. | ||
| // x86_64 only for now — portable to arm64 when we need it. |
|
|
||
| thread_local ThreadStackBounds t_stack_bounds{0, 0, false}; | ||
|
|
||
| static void populate_stack_bounds_for_this_thread() { |
There was a problem hiding this comment.
I already do almost exactly this for scanning in the mmtk branch. so I guess that'll have to get merged eventually
Rerun analyze once we move back to main
Adds a facility for giving bytecodes native "trampolines" so that they are visible by name in external backtraces (e.g. perf, gdb). Supersedes #1765. drmeister has done the work here, just filing a PR so I can review it.