[cc] memory ordering Updates by jgarzik · Pull Request #581 · rustcoreutils/posixutils-rs

jgarzik · 2026-06-01T02:10:45Z

No description provided.

First commit in the per-instruction operand/clobber constraint system milestone. No new constraint vocabulary yet — just plumbs two new sources of clobbers through the existing ConstraintPoint mechanism so the chordal allocator forbids them. 1. **Inline-asm explicit clobbers.** `Opcode::Asm` now produces constraints from `AsmData.clobbers`. Named GP register strings (rax/eax/al/r10d/etc on x86_64; x0/w0/fp/lr/etc on aarch64) are mapped to `Reg` via per-arch `parse_gp_clobber_name`. `"memory"` and `"cc"` are filtered out (the former gets full barrier semantics in C3; the latter is a no-op since we don't track condition-code liveness). 2. **Libc-call-emitting opcodes.** `find_call_positions` is parameterised on an `is_call_like(Opcode) -> bool` predicate. x86_64 extends the base set with `Fabs32/64`, `Signbit32/64`, `Memset/Memcpy/Memmove`; aarch64 extends with the same FP builtins (its Memset/Memcpy/Memmove reach the regular `Opcode::Call` path so are already covered). These IR opcodes lower to direct `Bl`/`Call` in `features.rs`, but linear scan hid the latent bug by never picking the at-risk regs; chordal coloring would expose it. aarch64 `get_constraint_info_aarch64` is built out (previously returned None for every instruction). Tests added: - x86_64 unit tests for `parse_gp_clobber_name` (size aliases, leading `%`, case insensitive, unknown/cc/memory rejection) - aarch64 unit tests for the same + special names (fp/lr/sp/xzr). - Both `is_call_like_*` predicates anchored so a future codegen change that adds a new builtin->libc lowering doesn't silently drop out. - `codegen_gp_live_across_inline_asm_clobber` defensive E2E test in cc/tests/codegen/misc.rs. Verified: - cc test suite 926+960+206 pass (13 new tests) - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m02s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second commit in the constraint-system milestone. Introduces the per-operand vocabulary that C3 will start consuming, but lowers it back to the existing `ConstraintPoint` for now so allocator behaviour is unchanged. New module: `cc/arch/asm_constraints.rs` defines enum OperandConstraint<R> { Any | Fixed(R) | Match(usize) | Mem | Imm } enum OperandKind { Use | Def | UseDef | EarlyClobber } struct InstrConstraints<R> { operands, clobbers, memory_barrier } struct OperandSpec<R> { pseudo, kind, constraint } enum ConstraintParseError { Empty, EarlyClobberOnInput, MultiAlternative, UnknownLetter } fn parse_constraint(s, letter_map) -> Result<(kind, con), error> Supported syntax (C2 scope): `r`/`w` (Any), `=r`/`+r`/`&=r` (Def/UseDef/EarlyClobber), `0`–`9` (Match), `m`/`Q` (Mem), `i`/`n` (Imm), plus per-arch Fixed letters. Multi-alternative constraints (`rm`/`ri`/`g`) and the rare per-arch letters (`q`/`Q`/`R`/`l`/`t`/`I`/`J`/...) are out of scope and produce a diagnostic — they become a follow-up after C5. Per-arch resolvers: - `parse_x86_64_fixed_letter` maps `a/b/c/d/S/D` → Rax/Rbx/Rcx/Rdx/Rsi/Rdi. - `parse_aarch64_fixed_letter` returns None for every letter (AAPCS64 has no C2-scope Fixed letters; all aarch64 register operands use `r` or `w`). The function exists for symmetry and as the future-extension point. Integration: x86_64 and aarch64 `get_constraint_info` now route inline asm through `build_asm_instr_constraints_*` → `lower_instr_constraints_to_constraint_point_*`. The lowering promotes Fixed operands into the clobber set and notes Match operands (for C3's coalescing edges) and EarlyClobber (for C3's extra interference). `memory_barrier` is captured but consumed in C3. 11 new unit tests in `cc/arch/asm_constraints.rs` cover every modifier+letter combination, EarlyClobber semantics, parse errors, and rejected multi-alternative strings. Verified: - cc test suite 937+971+206 pass (11 new parser tests) - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m37s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third commit in the constraint-system milestone. Wires the chordal allocator to consume `OperandConstraint::Fixed(R)` so the asm operand pseudo lands directly in the constraint-required register, eliminating the codegen's "move into fixed register around asm" defensive pattern in the common case. New per-arch helper `collect_asm_fixed_precolors_{x86_64,aarch64}` walks `func.blocks` for `Opcode::Asm`, builds the C2 `InstrConstraints` view, and collects every operand carrying a `Fixed(R)` constraint. The result is folded into chordal coloring via the `pre_colored` map in `color_gp_bank`. One gotcha caught and documented: Phase 3's commit loop skips pre-colored vertices on the assumption that their location is already in `self.locations` (true for ABI-pinned args, which `allocate_arguments` inserts before chordal runs). Inline-asm Fixed pre-colors arrive in `color_gp_bank` without going through `allocate_arguments`, so they also need a direct `self.locations.insert`. If missed, `get_location` returns `Loc::Imm(0)` as the default and every Store/Load involving the operand silently writes/reads zero — caught by `codegen_inline_asm_x86_64_mega::SPECIFIC_REGISTERS` returning 60 before the fix was added. Tests: - `codegen_inline_asm_x86_64_fixed_register_precolor` exercises every Fixed letter in C2 scope (a, b, c, d, S, D) in both `=` and bare positions. - `codegen_inline_asm_x86_64_asm_goto_pseudo_survives_edge` documents and locks in the existing asm-goto CFG plumbing. The linearizer in `cc/ir/linearize_stmt.rs:2100-2114` already `link_bb`s the asm block to every goto target plus the fall-through, so liveness propagates correctly. The plan worried this might be missing; verified that it's not. Out of scope for this commit (deferred follow-ups): - `OperandConstraint::Match(idx)` coalescing — the existing linearizer emits an explicit Copy from input pseudo to the matched output pseudo so they share a value at the IR level; the allocator already places them in the same location without needing an explicit interference-graph coalescing pass. - `InstrConstraints.memory_barrier` strict semantics — pcc's mem2reg is already in place (`cc/ir/mem2reg.rs` runs after `cc/ir/ssa.rs::ssa_convert` at `cc/ir/linearize.rs:1089`), so promoted locals have no stack slot for an asm `"memory"` clobber to invalidate, and memory-resident data (addr-taken, volatile, atomic, non-scalar) is always accessed through explicit `Load` / `Store` IR ops that pcc doesn't reorder across an asm. Strict barrier semantics are therefore already observed today; `memory_barrier` is captured on `InstrConstraints` for completeness and to anchor future passes that might add memory-access reordering (LICM, GVN, SCCP-style memory dataflow) and need an explicit barrier fence point. - Codegen rewrite that drops the mov-around-asm fallback. The existing fallback in `emit_inline_asm` is now a no-op for pre-colored operands but still handles the spilled-operand case correctly. Removing it is a quality improvement, not needed for correctness. aarch64 plumbing wired through for symmetry. AAPCS64 has no C2-scope Fixed letters today so `collect_asm_fixed_precolors_aarch64` always returns an empty map — but the future-extension point is in place. Verified: - cc test suite 937+971+207 pass (2 new tests) - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m09s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fourth commit in the constraint-system milestone. Lands the per-IR-opcode constraint-declaration infrastructure for freeing R10 and R11 from the reserved-scratch list, but does NOT yet add them to `Reg::allocatable()` — see the deferred-freeing block in the source for the empirical reasons why. `opcode_clobbers_r10_r11(op)` is a conservative predicate: every IR opcode whose codegen helper is more than a single instruction is assumed to touch R10 and/or R11. The truly clean opcodes (`Br`, `Nop`, `Phi`, `PhiSource`, `Unreachable`, `Fence`, `VaEnd`) are explicitly excluded. The predicate feeds `get_constraint_info`, so a constraint point with R10/R11 in its clobber set is built for every dirty IR instruction. The integration into `get_constraint_info` covers both the inline-asm path (already routed through `build_asm_instr_constraints_x86_64` from C2) and the regular opcode-level path. Clobber sets are merged from `opcode_constraints` (idiv/shift/vaarg hardware constraints) and the new R10/R11 entries, then sorted/deduped. **R10/R11 freeing is deferred — full rationale documented as a block comment immediately following `opcode_clobbers_r10_r11`.** TL;DR: per-IR-opcode constraints don't cover the inter- instruction codegen paths (function prologue rdi/rcx save through R10/R11 around `rep stosq`; variadic save area; spilled-arg prologue restore; FP / struct lowerings spanning multiple LIR emissions per IR instruction). Adding R10/R11 to `allocatable()` even with broad clobber declarations still segfaults CPython's `_bootstrap_python` at deepfreeze generation. Genuine freeing requires either a codegen refactor (~4000–8000 LOC) or a pre-IR scratch-declaration side table (~1500–2500 LOC) — both are multi-session milestones in their own right. This commit ships the predicate + integration so that when either prerequisite lands, adding R10/R11 to `Reg::allocatable()` is the only change needed to start producing forbidden-set entries. XMM14/XMM15 freeing was also out of scope: the chordal allocator's `color_xmm_bank` doesn't yet consume constraint points at all (`ConstraintPoint<XmmReg>` is unused). Building the XMM constraint plumbing is the structural prerequisite for their freeing. Verified: - cc test suite 937+971+208 pass - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m33s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fifth and final commit in the constraint-system milestone. Mirrors C4's x86_64 work for aarch64. New `opcode_clobbers_aarch64_scratches(op)` predicate uses the identical exclusion list as the x86_64 mirror (Br, Nop, Phi, PhiSource, Unreachable, Fence, VaEnd). `AARCH64_SCRATCH_REGS` lists the registers covered: X9, X10, X11 (the documented `Reg::scratch_regs()` triple) plus X16, X17 (AAPCS64 IP0/IP1, reused by the pair-address legalizer from commit `6af088eb`). `get_constraint_info_aarch64` is extended along the x86_64 shape: inline-asm path folds in the scratch clobbers on top of the C2 operand vocabulary, and the non-asm path emits a ConstraintPoint for any dirty opcode. **Freeing is deferred** — the deferred-freeing block in source documents the path forward. Same reasons as x86_64's C4 (inter-instruction codegen paths in the prologue, callee-saved save/restore, pair-address legalizer's X16 materialization, and int128 lo/hi shuttle). V16/V17/V18 freeing has the additional prerequisite of V-bank ConstraintPoint plumbing through `color_vreg_bank`. Adding the registers to `Reg::allocatable()` once the codegen refactor lands is the only remaining change. Verified: - cc test suite 937+971+208 pass - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 29m52s) This completes the constraint-system milestone (C1-C5). Summary of the five commits: - C1 (0f58a71): Source inline-asm + libc-call clobbers into ConstraintPoint. Plumbs new clobber sources through existing mechanism; fixes latent FP-pseudo-across-libc-call bug. - C2 (fd034c7): Per-operand inline-asm constraint vocabulary + GCC syntax parser. Introduces OperandConstraint / OperandKind / InstrConstraints. Lowers to ConstraintPoint for back-compat. - C3 (05be72d9): Allocator-aware inline-asm Fixed-operand pre-coloring. Asm operands with `"a"`/`"=a"` etc. land directly in the constraint-required register. - C4 (4f9697b9): x86_64 scratch-clobber declaration infrastructure for R10/R11; freeing deferred. - C5 (this commit): aarch64 mirror for X9/X10/X11/X16/X17; freeing deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First commit in the inline-asm memory-barrier milestone. Establishes the consumer-facing surface and the structural invariant; no codegen or allocator behavior change yet. The constraint-system C2/C3 commits already plumbed `InstrConstraints.memory_barrier` from the asm `"memory"` clobber through both `build_asm_instr_constraints_{x86_64,aarch64}` builders. What was missing: (a) a way for IR-pass code to ask the question without going through the per-arch constraint builder, (b) explicit documentation of the no-reorder invariant the existing optimizer passes rely on, and (c) a validator check that catches accidental drift between the barrier predicate and the DCE-root predicate. Three pieces: 1. `Instruction::is_memory_barrier()` returns true for: - `Opcode::Asm` with `"memory"` in its clobber list — the `pause`/`yield` spin-loop, ticket-lock acquire, and `__sync_synchronize`-style fence idioms. - `Opcode::Fence` — explicit C11 `atomic_thread_fence`. - `Opcode::Atomic*` — every atomic memory op (Relaxed included; finer-grained one-sided predicates can refine later when a pass needs the distinction). - `Opcode::Call` — pcc has no escape/alias analysis, so any external call must be treated as a possible memory access on any reachable location. Conservative. - `Opcode::Setjmp` / `Opcode::Longjmp` — non-local control flow. The predicate is the surface future memory-reordering passes (GVN, LICM, load-store forwarding, machine scheduler) will consult before crossing any of the above. 2. Module-level no-reorder contract documentation added to `cc/ir/dce.rs` and `cc/ir/instcombine.rs`. The audit before this commit confirmed that DCE only deletes Nops (never reorders) and that InstCombine only rewrites Add/Sub/Mul/Div/Mod/Shl/Lsr/Asr/ And/Or/Xor/SetEq/.../Neg/Not in place (never touches Load, Store, Asm, Call, Atomic*, Fence). Today the `"memory"` clobber "works" because no pass reorders memory ops at all — the contract codifies the implicit invariant so any future change that does start reordering must consult `is_memory_barrier()`. 3. New IR-validator invariant `I2 — MEMORY BARRIERS ARE DCE ROOTS`. Every instruction that satisfies `is_memory_barrier()` must also have `op.has_side_effects() == true`, or DCE would silently delete it. Today the two predicate sets align by construction; I2 surfaces drift immediately in debug builds rather than waiting for a memory-reordering pass to miscompile a real spinlock. New `ValidationError::BarrierWithoutSideEffect` variant. Validator runs only under `debug_assertions` via the existing `opt::optimize_module` debug_assert call site; release-mode pcc (CPython, all other -O builds) skips it for zero overhead. Tests: - `is_memory_barrier` cell at `cc/ir/mod.rs`: pure ops are not barriers; Fence + every Atomic* return true; Call/Setjmp/Longjmp return true; asm with `"memory"` returns true; asm with `"cc"` / named reg / empty clobbers / no asm_data return false. - `build_asm_instr_constraints_x86_64_propagates_memory_barrier` and aarch64 mirror: build the per-arch InstrConstraints from a hand-built asm Instruction and assert `memory_barrier` matches the clobber list. - `i2_barrier_implies_side_effect_structural` walks every barrier opcode and asserts the implication holds — a meta-test that catches drift if either predicate set changes. - `i2_asm_with_memory_clobber_validates`: end-to-end positive case through `validate_function`. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 208 pass (+5 new tests; was 941 + 975 + 208 in C5) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - Partial CPython -O2 rebuild (`Python/ceval.o`) byte-size identical to prior pcc — release-mode codegen unchanged, as expected given C6a touches only debug-only validation + new helper method. Full CPython -O2 test pass is structural-only and not at risk; will be re-verified at end of C6 milestone (C6b lock-in tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second commit in the inline-asm memory-barrier milestone. C6a established the consumer-facing predicate `Instruction::is_memory_barrier()` and documented the no-reorder contract in `cc/ir/dce.rs` and `cc/ir/instcombine.rs`. C6b adds the test moat that locks the contract in end-to-end so a future GVN / LICM / load-store forwarding / machine-scheduler regression fails loudly with the wrong return code rather than silently breaking a real kernel-style spinlock under -O2. Five new tests in `cc/tests/codegen/inline_asm.rs`, each exercising a different heavily-used `asm volatile("..." ::: "memory")` idiom and each run at both default and -O1 to catch the optimizer-only regression case: 1. `codegen_inline_asm_memory_barrier_pointer_aliased_reload` Plain pointer-aliased store-store-load across an empty compiler-barrier asm. A pass that store-forwarded the stale value across the barrier would observe 42 instead of 99. 2. `codegen_inline_asm_memory_barrier_spin_wait_pattern` `pause` (x86_64) / `yield` (aarch64) spin loop polling a non-volatile location through a pointer. The memory-clobber asm inside the loop body must force `*p` to reload each iteration; if a hoisting pass lifted the load out of the loop, the loop would never terminate and `iters > 10` would trip. 3. `codegen_inline_asm_memory_barrier_full_fence` `mfence` (x86_64) / `dmb ish` (aarch64) full fence between two stores to the same location. The post-fence load must observe the second store, not the first. 4. `codegen_inline_asm_memory_barrier_double_checked_init` Classic DCI pattern: release barrier between `value = 42` and `initialized = 1` in the slow path; acquire barrier in the reader after seeing `initialized == 1`. Single-threaded so we're only locking in compiler-level ordering (which is exactly what's at risk from future passes), but the pattern matches the real cross-thread idiom one-for-one. 5. `codegen_inline_asm_memory_barrier_sync_synchronize_pattern` `__sync_synchronize`-equivalent: a single fence between two unrelated store-then-verify pairs. A pass that reordered the second store-then-verify across the fence would observe the wrong value. The tests pass today on both x86_64 and aarch64 because no IR pass currently reorders memory ops at all — they're forward infrastructure. When the broader regalloc milestone moves into M8 / M9 / M10 (which add real reordering capabilities), any pass that drops a load, forwards a store, or hoists a load past an `is_memory_barrier()` instruction will trip one of these tests. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 213 pass (+5 new tests; was 946 + 980 + 208 in C6a) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR formalizes memory-ordering semantics in the IR and pushes inline-asm constraints into the register allocator. It adds an IR-level Instruction::is_memory_barrier() (asm "memory" clobber, Fence, Atomic*, Call, Setjmp, Longjmp), wires a validator invariant (I2) that every barrier must also be a DCE root via has_side_effects(), and introduces a shared cc/arch/asm_constraints.rs parser for GCC-style constraint strings. The x86_64 and aarch64 backends now thread asm clobbers/operands through get_constraint_info, pre-color Fixed operands in the chordal allocator, extend call-like detection to backend-emitted libc calls (Fabs*, Signbit*, Memcpy/Memset/Memmove), and replace the unsafe usize::MAX spill-slot drain with a monotonic-by-start commit. New runtime tests cover asm clobbers, asm-goto liveness, and fixed-register precoloring.

Changes:

Add Instruction::is_memory_barrier() + validator I2 (barrier ⇒ side-effect) with structural meta-test.
Add shared GCC inline-asm constraint vocabulary/parser and per-arch build_asm_instr_constraints_* / pre-color collectors; wire them into x86_64 and aarch64 chordal coloring.
Replace usize::MAX spill-slot drain with monotonic per-interval expiration to fix a within-block slot-reuse miscompile; extend find_call_positions with per-arch call-like predicates covering libc-emitting opcodes.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cc/ir/mod.rs	Adds `is_memory_barrier()` + unit tests covering pure ops, atomics, calls, asm clobbers.
cc/ir/validate.rs	Adds I2 invariant + `BarrierWithoutSideEffect` error variant and tests.
cc/ir/dce.rs, cc/ir/instcombine.rs	Documents the memory-ordering contract these passes implicitly preserve.
cc/arch/mod.rs, cc/arch/asm_constraints.rs	New shared inline-asm constraint parser and types.
cc/arch/regalloc.rs	`find_call_positions` now takes a per-arch `is_call_like` predicate.
cc/arch/x86_64/regalloc.rs	Asm clobber parsing, Fixed-operand precoloring, R10/R11 scratch declaration scaffolding, monotonic spill commit, libc-call list, tests.
cc/arch/aarch64/regalloc.rs	Mirror of x86_64 changes (scratch list X9–X11/X16/X17, asm clobber parser, precolor plumbing, monotonic spill commit).
cc/tests/codegen/inline_asm.rs	Adds asm-goto liveness test and fixed-register pre-color regression test.
cc/tests/codegen/misc.rs	Adds defensive GP-live-across-asm-clobber test (passes both with/without C1).

Copilot review of the C3 commit (PR comment at cc/arch/x86_64/regalloc.rs lines 1569–1580) flagged a real allocator/codegen split-view bug in `color_gp_bank`: pre_colored.entry(pid).or_insert(reg); // ABI pin wins ... self.locations.insert(pid, Loc::Reg(reg)); // overwritten unconditionally! When a pseudo is already pre-colored — either by an earlier inline-asm Fixed operand or, more importantly, by `allocate_arguments` (SysV pins int args 1..6 to RDI/RSI/RDX/RCX/ R8/R9) — `or_insert` correctly keeps the existing register in `pre_colored`. But the subsequent `self.locations.insert` was unconditional, overwriting the location to the asm-requested register. Chordal coloring (which reads `pre_colored`) and codegen (which reads `self.locations`) would then disagree: coloring would place the value in the prior pin, codegen would emit Load/Store against the new asm register, observing zero. Fix: capture the value `or_insert` actually committed to `pre_colored` and mirror exactly that into `self.locations`. Idempotent when the prior pin equals the asm pin; preserves the prior pin when they differ. Identical fix applied to the aarch64 mirror, where the bug is latent (AAPCS64 has no Fixed letters in the C2 vocabulary, so `collect_asm_fixed_precolors_aarch64` returns empty; but the fix matches the x86_64 commit byte-for- byte to anchor the contract). Tests: - `codegen_inline_asm_x86_64_fixed_precolor_collides_with_abi_pin` exercises the path: function params used directly as `"+S"` / `"+d"` / `"+a"` asm operands in their ABI-pinned register slot. Today's linearizer stores ABI args to local stack slots at entry and asm operands load through a fresh pseudo, so the bug doesn't trigger end-to-end on current pcc — the test can't fail under the broken code today. But it anchors the codegen path for future arg-promotion passes that would make the arg pseudo and the asm operand identical, at which point a silent regression would otherwise have no compile-time signal. Run at both default `-O -g` and -O1. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass (+1 new test; was 946 + 980 + 213 in C6b) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot flagged that both `build_asm_instr_constraints_x86_64` and `_aarch64` carry a `let _ = OperandKind::Use;` statement ostensibly silencing an unused-variant warning. The comment is wrong: `OperandKind::Use` is constructed by `parse_constraint` in `cc/arch/asm_constraints.rs:159` (the default kind for non-output, non-early-clobber operands), so the variant is actively used. The silencer line is dead code; removing it along with the now-unused `OperandKind` import. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot flagged `None?` as the fallthrough arm of `parse_gp_clobber_name` in `cc/arch/aarch64/regalloc.rs:708`. The x86_64 mirror at line 448 uses the more idiomatic `return None`. Aligning aarch64 to match. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot flagged that the inline-asm branch in `get_constraint_info` unconditionally augments the clobber set with R10/R11 via `opcode_clobbers_r10_r11`, but the explanatory comment immediately below only applies to the non-asm path. The augmentation is real and intentional (not unintentional): `emit_inline_asm` in `cc/arch/x86_64/codegen.rs` uses R10/R11 for operand shuffling into Fixed-letter registers, for `find_temp_reg` fallbacks when allocated regs collide with the reserved-scratch set, and for input/output spill helpers around the asm body. Today's reserved- scratch convention makes this a no-op; when C4's freeing lands, the augmentation becomes load-bearing for inline asm too. Added an explanatory comment at the asm branch covering this. No behavior change; doc-only. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First commit in the C9 milestone — full GCC-style inline-asm constraint vocabulary. C9a is the parser-only piece: it adds the `OperandConstraint::Alternatives(Vec<_>)` variant and teaches `parse_constraint` to build it from multi-character bodies like `"rm"`, `"ri"`, `"rmi"`, and `"g"` (sugar for `"rmi"`). The allocator and codegen wiring lands in C9b; this commit ships zero behavior change beyond letting these constraint strings parse without an error. Why now: portable kernel/UAPI headers (Linux `arch/x86/include/`, glibc internals, FreeBSD sys/) routinely use `"rm"` to let the compiler pick register vs memory based on what's cheapest at the call site. Before C9 these compiled with a hard parse error; after C9 they're accepted with a sensible default until the codegen learns to actually exploit the flexibility. Parser grammar (updated docstring): modifiers ::= '&'? ('=' | '+')? constraint ::= modifiers body body ::= class_letter+ | single_letter class_letter ::= 'r' | 'w' | 'm' | 'Q' | 'i' | 'n' | 'g' single_letter ::= class_letter | '0'..'9' | <Fixed letter> Multi-character bodies build an `OperandConstraint::Alternatives` whose inner list is restricted to `Any` / `Mem` / `Imm` — Fixed letters and Match digits cannot appear inside (GCC's `"ra"` / `"r0"` are nonsensical and rejected with `AlternativeWithFixed`). Implementation notes: - The single-character fast path still routes Fixed letters and Match digits through to their existing variants — no behavior change for C2/C3 code. - `g` is special-cased to `Alternatives([Any, Mem, Imm])`; when it appears inside a multi-char body it's flattened. - Duplicates dedup (e.g. `"rr"` → `Any`, `"rmm"` → `[Any, Mem]`). - The old `ConstraintParseError::MultiAlternative` variant is replaced by the more specific `AlternativeWithFixed(char, _)` for the only remaining error case (class + Fixed/Match mix). - `lower_instr_constraints_to_constraint_point` in both backends gets a new `Alternatives(_)` arm that's a no-op — by construction these alternatives don't add allocator clobbers. C9b will pick an alternative based on operand fit and feed it into emit_inline_asm. Tests: +9 new parser unit tests in `cc/arch/asm_constraints.rs`: - `parse_multi_alt_{rm,ri,rmi}` — each common combo - `parse_g_is_sugar_for_rmi` — equivalence of `g` and `rmi` - `parse_multi_alt_g_flattens` — `rg` produces the flattened set - `parse_multi_alt_dedup` — `rr` / `rmm` collapse correctly - `reject_alternative_with_fixed_letter` — `ra` / `=mb` rejected - `reject_alternative_with_match_digit` — `r0` rejected - `single_char_fixed_still_works` — C3 path preserved - `single_char_match_still_works` — Match digits preserved Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 955 + 989 + 214 pass (+9 new parser tests; was 946 + 980 + 214) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean CPython smoke test deferred until C9b (when the new variant actually influences codegen — today it can't reach Alternatives from any real C source because the front end's inline-asm path hasn't started exercising multi-char bodies; this commit just sets up the foundation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second commit in C9. C9a taught the parser to build `OperandConstraint::Alternatives`; C9b wires the linearizer and the x86_64 codegen to honor "register OR memory" and "register OR immediate" semantics end-to-end. The user-visible result: code like `__asm__("addl $1, %0" : "+rm"(x))` — the canonical kernel/UAPI multi-alt idiom — now compiles and runs correctly on both arches at default and -O. The previous behavior had three bugs that combined to produce silently wrong output on this pattern: 1. **Linearizer treated `"rm"` as `"m"`**. `parse_asm_constraint` set `is_memory = true` if any of `m`/`o`/`V`/`Q` appeared in the constraint string. For `"rm"` this fired even though register is acceptable, so the linearizer took the `lvalue` path (`linearize_lvalue`) and passed the operand's ADDRESS as the asm pseudo. The codegen later substituted `%1` with the register holding the address, so the asm template read the address bits as if they were the value. 2. **x86_64 codegen's `constraint_requires_register` returned true for any constraint containing `r`**. For `"rm"` this force-loaded spilled values into a temp register, defeating the whole point of multi-alt — the memory operand syntax was never used. 3. **x86_64 codegen's `constraint_requires_memory` returned true for any constraint containing `m`**. For `"rm"` it would force memory syntax even when register was clearly the right choice given the operand's location. Fix: - `cc/ir/linearize_stmt.rs::parse_asm_constraint` — `is_memory` is now strict: true iff every alternative listed is a memory class (`m`/`o`/`V`/`Q`) AND no non-memory class (`r`/`a`..`d`/`S`/`D`/`q`/`R`/`i`/`n`/`g`/`X`). For multi-alt strings like `"rm"` / `"rmi"` / `"g"`, the linearizer evaluates the operand as a value (`linearize_expr`) and lets the codegen substitute register vs memory syntax based on the operand's actual location. - `cc/arch/x86_64/codegen.rs::constraint_requires_register` — returns true iff the constraint lists ONLY register-class letters. Any non-register letter (`m`/`o`/`V`/`Q`/`i`/`n`/`g`) means a spilled value need not be force-loaded to a temp. - `cc/arch/x86_64/codegen.rs::constraint_requires_memory` — symmetric: returns true iff the constraint lists ONLY memory- class letters. The aarch64 codegen needed no change — its `emit_inline_asm` follows whatever location the allocator assigned without classifying the constraint, so multi-alt has worked end-to-end on aarch64 since C9a. Tests: +5 new integration tests in `cc/tests/codegen/inline_asm.rs`: - `codegen_inline_asm_multi_alt_rm_register_path` — `"+rm"` with register-resident value - `codegen_inline_asm_multi_alt_m_memory_path` — `"+m"` with addr-taken operand - `codegen_inline_asm_multi_alt_g_runtime_value` — `"g"` with a runtime int input - `codegen_inline_asm_multi_alt_output_rm_to_memory` — `"=rm"` output to addr-taken local - `codegen_inline_asm_multi_alt_mixed_operands` — two `"rm"` inputs + one `"=r"` output in the same asm (this one caught the linearizer-address bug) Each runs at both default `-O -g` and -O1. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 955 + 989 + 219 pass (+5 new integration tests; was 955 + 989 + 214 in C9a) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 28m24s. Linearizer hot-path change leaves the baseline intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third commit in the inline-asm constraint vocabulary milestone. C2 covered the common register-class and Fixed letters; C9 added multi-alternative constraints (`"rm"`, `"ri"`, `"g"`). C10 fills out the rare per-arch letters that show up in glibc, kernel headers, and other portable inline asm: x86_64 class letters (`parse_x86_64_class_letter`): - `q` — byte-class register. Modern x86_64 makes every GP register byte-accessible, so we map to `Any`. - `R` — legacy 8-register set; same as `r` on x86_64. Maps to `Any`. - `l` — index registers (RSI, RDI); same as `r` in practice. Maps to `Any`. - `I`, `J`, `K`, `L`, `M`, `N`, `O` — constant-range immediates used for shift counts and small literals. All map to `Imm`. - `X` — any operand (register/memory/immediate); same as `g`. aarch64 class letters (`parse_aarch64_class_letter`): - `I`, `J`, `K`, `L`, `M`, `N`, `S`, `Y`, `Z` — AAPCS64 immediate- range and zero-register letters. All map to `Imm`. Mechanism: `parse_constraint` becomes a thin wrapper around the new `parse_constraint_with_classes`, which accepts a second closure resolving per-arch class letters in addition to the existing Fixed-letter closure. Each backend supplies both via `parse_x86_64_{fixed,class}_letter` / `parse_aarch64_{fixed,class}_letter`. The `parse_constraint` wrapper passed `|_| None` as the class mapper for backward compatibility, but turned out to be only used by tests once the backends migrated to `parse_constraint_with_classes`; removed it and migrated the tests directly to keep `parse_constraint_with_classes` as the single canonical entry point. pcc does NOT range-check the immediate letters. If the supplied operand is out of the letter's required range (e.g., `"I"` with value 32), the assembler rejects the resulting asm template at assemble time — the same failure mode GCC defaults to. Codegen-side updates: - `cc/arch/x86_64/codegen.rs::constraint_requires_register` — `q`/`R`/`l` join the register-class set; `g`/`X` join the memory-acceptable set. Immediate letters (`I`/`J`/`K`/...) do NOT make memory acceptable: combined with `r` (e.g. `"rI"`) a spilled value still force-loads. - `cc/arch/x86_64/codegen.rs::constraint_requires_memory` — all new register-class and immediate-class letters listed explicitly so the function is exhaustive over C10's vocabulary. - `cc/ir/linearize_stmt.rs::parse_asm_constraint` — the new letters explicitly opt out of `has_memory_class` so the linearizer evaluates the operand as a value, not an address. aarch64's codegen needed no further updates — its `emit_inline_asm` uses whatever location the allocator assigned without classifying the constraint, so all class letters work end-to-end once they reach the InstrConstraints lowering. Tests: - +6 new parser unit tests in `cc/arch/asm_constraints.rs` covering: class letter → Any, class letter → Imm, modifier combination (`=J`), multi-alt with class letter (`rK`), class-letter precedence over Fixed letter when both maps resolve the same char, unknown letter still errors. - +3 new integration tests in `cc/tests/codegen/inline_asm.rs`: x86_64 `q` (byte-class register store/load), x86_64 `I` (compile-time shift count), x86_64 `R` (synonym for `r`). - +1 aarch64 integration test for `K` (logical immediate). - All 30 existing parser unit tests migrated to `parse_constraint_with_classes` with a no-op class map. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 961 + 995 + 222 pass (+6 parser unit, +3 integration on x86_64, +1 on aarch64; was 955 + 989 + 219 in C9b — note 222 = 219 + 3 on x86_64 since the aarch64-only test is `#[cfg(target_arch = "aarch64")]`) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 28m49s. The linearizer + codegen letter-set expansions don't regress real-world code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closing the inline-asm constraint vocabulary milestone. C9c was originally scoped as the codegen optimization that substitutes a const-propagated operand as a literal (e.g. `mov $42, %eax`) when the asm constraint allows immediate (`"i"`/`"n"`/`"ri"`/`"g"` /x86_64 `"I"`/.../`"O"`). Audit found that `emit_inline_asm`'s `Loc::Imm(v)` arm already does this implicitly: any operand whose allocator location is `Loc::Imm` renders as `$value` regardless of which class letter accepted it. Verified by inspecting the generated assembly for `"ri"(42)`, `"g"(99)`, and `"I"(3)` — each produces a direct `mov $N`/`shl $N` instruction with no intermediate register load. C9c therefore ships as lock-in tests only, no codegen change. Four new tests in `cc/tests/codegen/inline_asm.rs`: 1. `codegen_inline_asm_imm_const_via_ri` — `"ri"(42)` substitutes as `$42`. 2. `codegen_inline_asm_imm_const_via_g` — `"g"(99)` substitutes as `$99`. 3. `codegen_inline_asm_imm_const_via_i_letter` — x86_64 `"I"(3)` substitutes as a shift count. 4. `codegen_inline_asm_imm_runtime_via_ri_uses_register` — negative case: `"ri"` with a runtime operand goes through a register, NOT substituted as literal. Pins the boundary so a future codegen change that, e.g., always force-loaded Loc::Imm operands wouldn't accidentally pass. Each test runs at both default `-O -g` and -O1. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 961 + 995 + 226 pass (+4 new lock-in tests; was 961 + 995 + 222 in C10) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean CPython smoke test skipped — this commit is test-only, no production-code change can have regressed the baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit conclusion: the Sreedhar et al. "Translating Out of SSA" algorithm for parallel-copy sequentialization is already implemented in `sequentialize_copies` and exercised end-to-end by `test_phisource_swap_problem`. The remaining work for M8 is lock-in coverage of cases the existing tests don't reach. Two new unit tests in `cc/ir/lower.rs`: 1. `test_sequentialize_two_disjoint_cycles` — two independent 2-cycles in one parallel-copy block. The algorithm must break each cycle with its own temp and emit all four destination writes. Catches a regression where the cycle-breaking loop would fail to terminate when multiple SCCs are present simultaneously. 2. `test_sequentialize_fp_cycle_propagates_type` — a 2-cycle on float-typed copies. The temp pseudo created to break the cycle inherits the source's type via the emitted Copy's `typ` field; `identify_fp_pseudos` then picks the temp up by scanning instruction types. Without this propagation the temp would be misallocated to the GP bank and corrupt the FP cycle's values. No production-code change — the algorithm was already correct. These tests pin the contract so a future refactor of `sequentialize_copies` (e.g., adopting Briggs's window-based 2-temp-budget approach) doesn't accidentally lose the multi-SCC or type-propagation properties. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass (+2 new tests; was 961 + 995 + 226 in C9c) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 29m34s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First commit in the M9 register-coalescing milestone. M9a is the codegen-side foundation: when `emit_copy_with_type` sees the allocator placed src and dst at the same `Loc`, skip emission entirely instead of emitting `mov %eax, %eax` etc. Gated on `actual_size >= 32` (so narrow-type 8/16 bit truncations via AND/SHL+SAR still run) and `!is_fp_copy` (FP path has size- specific quirks). Measurement on CPython 3.12.9 `Python/ceval.o` showed 105 of 10,653 reg-to-reg movs were "identity" (`mov %X, %X`) under the old codegen, but 104 of those 105 are the deliberate `mov %eax, %eax` zero-extension idiom that pcc emits to clear the upper 32 bits after a 64-bit op — those go through a different path (the explicit narrowing in emit_move and the operand-size machinery) and are NOT routed through emit_copy_with_type. Only 1 actual 64-bit identity Copy fires the new short-circuit in ceval.o. So M9a alone is a near-no-op on real code. Its role is **infrastructure for M9b**: once M9b's allocator-level coalescing lands and forces matched Copy operands to share a location, every successfully-coalesced Copy collapses to identity here and is elided. M9a is the elision sink that makes M9b's gains visible. `Loc` on aarch64 now derives `PartialEq` (needed by the src_loc == dst_loc check). x86_64's Loc already derived it. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass (unchanged from M8) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 30m17s. Baseline holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second commit in the M9 register-coalescing milestone, and the piece that produces a measurable win. After the chordal allocator finishes assigning every GP pseudo a `Loc`, walk every `Opcode::Copy { target: t, src: [s] }` in the function and migrate `t`'s location to `s`'s location whenever the move is safe. The Copy then becomes identity and M9a's `emit_copy_with_type` short-circuit elides it. Migration is safe iff: - both `t` and `s` are in the bank being coalesced (GP for `color_gp_bank`); - `t` is not pre-colored (ABI-pinned arg or asm `Fixed`-letter operand; moving it would violate the constraint that pinned it); - `t` and `s` do not interfere (the IR doesn't ask them to hold distinct values simultaneously); - `t`'s forbidden-color set doesn't contain `s`'s register (the constraint system reserved that register against `t`, e.g. a cross-call pseudo against caller-saved regs); - no neighbor of `t` already holds `s`'s register (would create a same-color conflict at the new location). Naive non-Briggs heuristic: per-candidate decision, no global degree analysis, no spill-regression guard. Empirically the chordal allocator's existing spill paths absorb any edge cases. Measurement on CPython 3.12.9 `Python/ceval.o` (-O2): - reg-to-reg movs: 10,653 → 10,614 (-39, -0.4%) - .o size: 395,480 → 389,432 bytes (-6,048, -1.5%) Small but real. The mov-savings come predominantly from phi- elimination copies (every loop's back edge in `_PyEval_EvalFrameDefault`'s massive dispatch loop) and matched-asm-constraint init copies (closes the C8 deferral implicitly — the linearizer's explicit `Copy(out_pseudo, val)` for `"0"(x)`-style constraints now folds into M9a elision when the operands' lifetimes don't interfere). C8 was deferred because matched-asm coalescing wanted a Briggs heuristic to avoid spilling. The post-coloring approach sidesteps that issue: the chordal allocator has already made all the placement decisions; coalescing only swaps locations after the fact when the swap is safe. No new spills introduced. Implementation: - `cc/arch/regalloc.rs::find_copy_coalesce_candidates` — walk Opcode::Copy candidates. - `cc/arch/x86_64/regalloc.rs::color_gp_bank` — coalesce after Phase 3 commit, before spill commits. - `cc/arch/aarch64/regalloc.rs::color_gp_bank` — mirror. - FP banks (`color_xmm_bank` / `color_vreg_bank`) intentionally left out of M9b — FP-typed Copies are rarer and the codegen's FP move paths have size-specific quirks that the naive migration doesn't account for. A future M9c could extend. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass (unchanged from M9a) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 29m28s. Allocator-level coalescing leaves the baseline intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…locks Final commit in the M* regalloc-redesign milestone. M10 is the "dead-code freeze" piece: defensive validator infrastructure that locks in the optimizer's structural contract before the codegen layer consumes the IR. New invariant I3: every branch-style instruction's `BasicBlockId` reference (Br's `bb_true`, Cbr's `bb_true` + `bb_false`, Switch's `switch_cases` + `switch_default`, inline asm-goto's `AsmData.goto_labels`) must point at a block actually present in `func.blocks`. A reference to a deleted or never-created block ID crashes the codegen layer's label-resolution pass with a confusing error far from the source bug; I3 catches the corruption at the optimizer/codegen boundary instead, with a clear `ValidationError::InvalidBranchTarget { block, index, opcode, target }` diagnostic. Like I1 and I2, the check runs under `debug_assertions` only — release-mode pcc skips it for zero overhead. The validator gate at `cc/opt.rs::optimize_module` already calls `validate_module` under `debug_assert!`, so I3 is wired automatically. Tests: - `i3_valid_branch_target_passes` — well-formed Br succeeds. - `i3_invalid_br_target_flagged` — Br to a nonexistent block produces `InvalidBranchTarget`. - `i3_invalid_cbr_false_target_flagged` — Cbr's `bb_false` is checked independently from `bb_true`. I4/I5/I6 (Copy bank consistency, phi predecessor CFG agreement, Load/Store address-bank) would each need additional plumbing (TypeTable or CFG-predecessor maps) for marginal defensive value. Out of scope. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass (+3 new tests; was 963 + 997 + 226 in M9b) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 28m06s. Validator-only change, no codegen impact in release builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes a long-standing aarch64 codegen gap exposed by C9b's new `codegen_inline_asm_multi_alt_m_memory_path` test on the macos-aarch64 CI worker. The test compiles: int x = 99; int *p = &x; __asm__("ldr w8, %0\n\tadd w8, w8, #1\n\tstr w8, %0" : "+m"(*p)); For the `"+m"(*p)` constraint, the linearizer takes the lvalue (returning an address pseudo). The chordal allocator places that address pseudo in a register. Old codegen substituted `%0` with the bare register name (`w0`), producing: ldr w8, w0 <- error: expected label or encodable integer pc offset str w8, w0 <- error: invalid operand for instruction The aarch64 assembler correctly rejects this because `ldr`/`str` expect a memory operand (e.g. `[x0]`), not a register name. Fix: in `emit_inline_asm`, when the constraint is memory-class (`m`/`o`/`V`/`Q` only — `"rm"` and `"g"` etc. are excluded since register or immediate is also acceptable), AND the operand's location is `Loc::Reg(r)`, substitute the operand as `[xN]` instead of the bare register name. This matches the x86_64 mirror's `Loc::Reg(r) if requires_mem` arm at `cc/arch/x86_64/codegen.rs:3622`. New helper `Self::constraint_requires_memory(constraint: &str)` mirrors x86_64's equivalent (added in C9b). It counts as "requires memory" only when every alternative in the constraint is a memory class — single-class `"m"` returns true; multi-alt `"rm"`/`"g"` return false so register substitution still wins when the operand fits in a register. Out of scope for this commit: the `Loc::Stack` / `Loc::IncomingArg` case for memory-class operands. If the address pseudo itself spills to the stack, the current fallthrough hits `loc_to_asm_string`, which gives `[sp, #offset]` — the address of the spill slot rather than the value stored in it. A correct fix would emit a pre-asm load to a temp register and use that register as `[xN]`. None of the existing tests exercise the spilled-address path (the test functions are too small to push the address pseudo out of registers), and a fix would require the same input-load / output-restore plumbing x86_64 has at `emit_inline_asm`. Deferred until a real spill case surfaces. Verified locally: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass (no change in counts; the failing test runs only on aarch64) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Linux-amd64 cargo tests were green before this change; the fix is to the aarch64-only emit_inline_asm path and cannot regress x86_64 codegen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two backends' emit_inline_asm previously maintained four parallel Vecs for each operand's substitution data: let mut operand_regs: Vec<Option<Reg>> = ...; let mut operand_sizes: Vec<u32> = ...; let mut operand_mem: Vec<Option<String>> = ...; let mut operand_names: Vec<Option<String>> = ...; Every branch of the operand-classifying match pushed to (some of) these vecs in lockstep -- there were 14 push pairs across the two backends, each one a bug waiting for a missed push to silently de-sync the arrays. Replaced with a single AsmOperandSlot<R> struct in cc/arch/codegen.rs: pub struct AsmOperandSlot<R> { pub reg: Option<R>, pub mem: Option<String>, pub size: u32, pub name: Option<String>, } and a single Vec<AsmOperandSlot<Reg>> per emit_inline_asm. Each branch builds and pushes one slot, with all four fields populated by construction. A per-iteration mk(reg, mem) closure captures the operand's size and name once at the top of the loop so each branch only specifies the (reg, mem) pair appropriate to its outcome. substitute_asm_operands (shared in cc/arch/codegen.rs) and the per-backend wrappers now take a single &[AsmOperandSlot<F::Reg>] slice instead of four parallel slices. The substitution logic itself is unchanged -- only the indexing syntax. Behavior-preserving. The bookkeeping vecs that DO have distinct roles -- output_moves / input_moves / remap_setup / remap_restore / pseudo_to_temp -- stay separate; they're not parallel to the per-operand substitution data. Verified: - cargo build --release clean - cargo test --release -p posixutils-cc: 966 + 1000 + 226 pass (unchanged) - cargo clippy --all-targets zero warnings - cargo fmt --all -- --check clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 30m41s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two coupled fixes that together restore GCC-correct semantics for memory-class inline asm outputs (`"=m"(*p)`, `"+m"(*p)`, and their `o`/`V`/`Q` siblings). The macos-aarch64 CI was failing on the C9b `codegen_inline_asm_multi_alt_m_memory_path` test because of the codegen half of this issue; the test was passing on x86_64 only by accident (the old behaviour quietly used register syntax and a post-asm Store to paper over the bug). Linearizer change (cc/ir/linearize_stmt.rs): Previously, for `"=m"(*p)` the linearizer: - allocated a fresh pseudo for the output, - on `is_readwrite` (`"+m"`) emitted Load(pseudo, addr) — loading the *value* at *p into the pseudo, - inserted a post-asm Store(pseudo, addr) writing back. The pseudo therefore held a *value*, while the constraint declared the operand as memory. The asm template substitution treated `%0` as a register holding the value (x86) or the value loaded into a w-register (aarch64), and on aarch64 the assembler rejected the resulting `ldr w8, w0` outright. On x86_64 it "worked" via the Load+Store dance — but the asm `addl $1, %0` became `addl $1, %eax`, modifying a register that subsequently got stored back to memory. Wrong shape; happens to compute the right answer for simple increment patterns. Now, when the output's constraint is memory-class (`is_memory == true`): - the output pseudo is the *address* itself, taken from `linearize_lvalue(&op.expr)`; - no initial Load is emitted; - the matched `+m` input shares the same address pseudo; - the post-asm Store is skipped (new `skip_post_handling` parallel-vec flag — separate from the existing `param_outputs` so the three states "skip", "store to addr", "update var_map" are all expressible). The asm template's `%0` then resolves to the memory operand syntax with the address register: `[xN]` on aarch64, `(%rN)` on x86_64. The hardware ldr/str / addl-to-memory does the actual read/write through that address. Codegen change (cc/arch/x86_64/codegen.rs): Added the missing `Loc::Reg(r) if requires_mem` arm on the *output* side of `emit_inline_asm`. The input side already had the corresponding arm at line 3642 (`format!("(%{})", ...)`); the output side fell through to bare-register substitution, so `addl $1, %0` substituted `%eax` instead of `(%rax)`. The new arm mirrors the input side: render the address register as `(%rN)`. aarch64's output side already had the equivalent `[xN]` arm from the macos-aarch64 fix commit; no change needed there. Why CPython doesn't show the underlying bug: A grep of CPython 3.12.9's Python/, Modules/, and Include/ trees shows zero uses of `"=m"` or `"+m"` outputs in code that gets compiled under our pcc build. `pymath.c`'s `fnstcw`/`fldcw` asms live under `#ifdef HAVE_GCC_ASM_FOR_X87`, which is unset in our configuration. So this fix is exercised entirely by the posixutils-cc inline-asm test suite, not by CPython workloads. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass (the previously-failing `codegen_inline_asm_multi_alt_m_memory_path` now passes on both arches; x86_64 path verified locally, aarch64 path verified by inspection — the codegen produces `[xN]` syntax via the existing requires_mem arm) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 28m56s. (A first run flaked on test_venv's multiprocessing spawn at minute 28 — the failure was a resource-accumulation artifact of the fork-heavy test parade ahead of it; running test_venv in isolation passes, and the retry of the full suite passes cleanly.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jgarzik and others added 6 commits May 31, 2026 17:32

jgarzik requested a review from Copilot June 1, 2026 02:10

jgarzik self-assigned this Jun 1, 2026

Copilot started reviewing on behalf of jgarzik June 1, 2026 02:10 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread cc/arch/x86_64/regalloc.rs Outdated

Comment thread cc/arch/x86_64/regalloc.rs Outdated

Comment thread cc/arch/aarch64/regalloc.rs Outdated

Comment thread cc/arch/x86_64/regalloc.rs

jgarzik and others added 15 commits June 1, 2026 04:25

jgarzik merged commit 5fa0713 into main Jun 2, 2026
9 checks passed

jgarzik deleted the updates branch June 2, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cc] memory ordering Updates#581

[cc] memory ordering Updates#581
jgarzik merged 22 commits into
mainfrom
updates

jgarzik commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgarzik commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants