Skip to content

[cc] memory ordering Updates#581

Merged
jgarzik merged 22 commits into
mainfrom
updates
Jun 2, 2026
Merged

[cc] memory ordering Updates#581
jgarzik merged 22 commits into
mainfrom
updates

Conversation

@jgarzik
Copy link
Copy Markdown
Contributor

@jgarzik jgarzik commented Jun 1, 2026

No description provided.

jgarzik and others added 6 commits May 31, 2026 17:32
First commit in the per-instruction operand/clobber constraint
system milestone. No new constraint vocabulary yet — just plumbs
two new sources of clobbers through the existing ConstraintPoint
mechanism so the chordal allocator forbids them.

1. **Inline-asm explicit clobbers.** `Opcode::Asm` now produces
   constraints from `AsmData.clobbers`. Named GP register strings
   (rax/eax/al/r10d/etc on x86_64; x0/w0/fp/lr/etc on aarch64)
   are mapped to `Reg` via per-arch `parse_gp_clobber_name`.
   `"memory"` and `"cc"` are filtered out (the former gets full
   barrier semantics in C3; the latter is a no-op since we don't
   track condition-code liveness).

2. **Libc-call-emitting opcodes.** `find_call_positions` is
   parameterised on an `is_call_like(Opcode) -> bool` predicate.
   x86_64 extends the base set with `Fabs32/64`, `Signbit32/64`,
   `Memset/Memcpy/Memmove`; aarch64 extends with the same FP
   builtins (its Memset/Memcpy/Memmove reach the regular
   `Opcode::Call` path so are already covered). These IR opcodes
   lower to direct `Bl`/`Call` in `features.rs`, but linear scan
   hid the latent bug by never picking the at-risk regs;
   chordal coloring would expose it.

aarch64 `get_constraint_info_aarch64` is built out (previously
returned None for every instruction).

Tests added:
  - x86_64 unit tests for `parse_gp_clobber_name` (size aliases,
    leading `%`, case insensitive, unknown/cc/memory rejection)
  - aarch64 unit tests for the same + special names (fp/lr/sp/xzr).
  - Both `is_call_like_*` predicates anchored so a future codegen
    change that adds a new builtin->libc lowering doesn't silently
    drop out.
  - `codegen_gp_live_across_inline_asm_clobber` defensive E2E test
    in cc/tests/codegen/misc.rs.

Verified:
  - cc test suite 926+960+206 pass (13 new tests)
  - cargo clippy --all-targets clean
  - cargo fmt --check clean
  - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m02s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in the constraint-system milestone. Introduces the
per-operand vocabulary that C3 will start consuming, but lowers it
back to the existing `ConstraintPoint` for now so allocator
behaviour is unchanged.

New module: `cc/arch/asm_constraints.rs` defines

  enum OperandConstraint<R> { Any | Fixed(R) | Match(usize) | Mem | Imm }
  enum OperandKind          { Use | Def | UseDef | EarlyClobber }
  struct InstrConstraints<R> { operands, clobbers, memory_barrier }
  struct OperandSpec<R>      { pseudo, kind, constraint }
  enum ConstraintParseError  { Empty, EarlyClobberOnInput, MultiAlternative, UnknownLetter }
  fn parse_constraint(s, letter_map) -> Result<(kind, con), error>

Supported syntax (C2 scope): `r`/`w` (Any), `=r`/`+r`/`&=r`
(Def/UseDef/EarlyClobber), `0`–`9` (Match), `m`/`Q` (Mem),
`i`/`n` (Imm), plus per-arch Fixed letters. Multi-alternative
constraints (`rm`/`ri`/`g`) and the rare per-arch letters
(`q`/`Q`/`R`/`l`/`t`/`I`/`J`/...) are out of scope and produce a
diagnostic — they become a follow-up after C5.

Per-arch resolvers:
  - `parse_x86_64_fixed_letter` maps `a/b/c/d/S/D` → Rax/Rbx/Rcx/Rdx/Rsi/Rdi.
  - `parse_aarch64_fixed_letter` returns None for every letter
    (AAPCS64 has no C2-scope Fixed letters; all aarch64 register
    operands use `r` or `w`). The function exists for symmetry
    and as the future-extension point.

Integration: x86_64 and aarch64 `get_constraint_info` now route
inline asm through `build_asm_instr_constraints_*` →
`lower_instr_constraints_to_constraint_point_*`. The lowering
promotes Fixed operands into the clobber set and notes Match
operands (for C3's coalescing edges) and EarlyClobber (for C3's
extra interference). `memory_barrier` is captured but consumed
in C3.

11 new unit tests in `cc/arch/asm_constraints.rs` cover every
modifier+letter combination, EarlyClobber semantics, parse errors,
and rejected multi-alternative strings.

Verified:
- cc test suite 937+971+206 pass (11 new parser tests)
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m37s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third commit in the constraint-system milestone. Wires the
chordal allocator to consume `OperandConstraint::Fixed(R)` so
the asm operand pseudo lands directly in the constraint-required
register, eliminating the codegen's "move into fixed register
around asm" defensive pattern in the common case.

New per-arch helper `collect_asm_fixed_precolors_{x86_64,aarch64}`
walks `func.blocks` for `Opcode::Asm`, builds the C2
`InstrConstraints` view, and collects every operand carrying a
`Fixed(R)` constraint. The result is folded into chordal coloring
via the `pre_colored` map in `color_gp_bank`.

One gotcha caught and documented: Phase 3's commit loop skips
pre-colored vertices on the assumption that their location is
already in `self.locations` (true for ABI-pinned args, which
`allocate_arguments` inserts before chordal runs). Inline-asm
Fixed pre-colors arrive in `color_gp_bank` without going through
`allocate_arguments`, so they also need a direct
`self.locations.insert`. If missed, `get_location` returns
`Loc::Imm(0)` as the default and every Store/Load involving the
operand silently writes/reads zero — caught by
`codegen_inline_asm_x86_64_mega::SPECIFIC_REGISTERS` returning 60
before the fix was added.

Tests:
- `codegen_inline_asm_x86_64_fixed_register_precolor` exercises
  every Fixed letter in C2 scope (a, b, c, d, S, D) in both `=`
  and bare positions.
- `codegen_inline_asm_x86_64_asm_goto_pseudo_survives_edge`
  documents and locks in the existing asm-goto CFG plumbing.
  The linearizer in `cc/ir/linearize_stmt.rs:2100-2114` already
  `link_bb`s the asm block to every goto target plus the
  fall-through, so liveness propagates correctly. The plan
  worried this might be missing; verified that it's not.

Out of scope for this commit (deferred follow-ups):
- `OperandConstraint::Match(idx)` coalescing — the existing
  linearizer emits an explicit Copy from input pseudo to the
  matched output pseudo so they share a value at the IR level;
  the allocator already places them in the same location without
  needing an explicit interference-graph coalescing pass.
- `InstrConstraints.memory_barrier` strict semantics — pcc's
  mem2reg is already in place (`cc/ir/mem2reg.rs` runs after
  `cc/ir/ssa.rs::ssa_convert` at `cc/ir/linearize.rs:1089`),
  so promoted locals have no stack slot for an asm `"memory"`
  clobber to invalidate, and memory-resident data (addr-taken,
  volatile, atomic, non-scalar) is always accessed through
  explicit `Load` / `Store` IR ops that pcc doesn't reorder
  across an asm. Strict barrier semantics are therefore already
  observed today; `memory_barrier` is captured on
  `InstrConstraints` for completeness and to anchor future
  passes that might add memory-access reordering (LICM, GVN,
  SCCP-style memory dataflow) and need an explicit barrier
  fence point.
- Codegen rewrite that drops the mov-around-asm fallback. The
  existing fallback in `emit_inline_asm` is now a no-op for
  pre-colored operands but still handles the spilled-operand
  case correctly. Removing it is a quality improvement, not
  needed for correctness.

aarch64 plumbing wired through for symmetry. AAPCS64 has no
C2-scope Fixed letters today so `collect_asm_fixed_precolors_aarch64`
always returns an empty map — but the future-extension point is
in place.

Verified:
- cc test suite 937+971+207 pass (2 new tests)
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m09s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth commit in the constraint-system milestone. Lands the
per-IR-opcode constraint-declaration infrastructure for freeing
R10 and R11 from the reserved-scratch list, but does NOT yet add
them to `Reg::allocatable()` — see the deferred-freeing block
in the source for the empirical reasons why.

`opcode_clobbers_r10_r11(op)` is a conservative predicate: every
IR opcode whose codegen helper is more than a single instruction
is assumed to touch R10 and/or R11. The truly clean opcodes
(`Br`, `Nop`, `Phi`, `PhiSource`, `Unreachable`, `Fence`, `VaEnd`)
are explicitly excluded. The predicate feeds
`get_constraint_info`, so a constraint point with R10/R11 in
its clobber set is built for every dirty IR instruction.

The integration into `get_constraint_info` covers both the
inline-asm path (already routed through
`build_asm_instr_constraints_x86_64` from C2) and the regular
opcode-level path. Clobber sets are merged from
`opcode_constraints` (idiv/shift/vaarg hardware constraints) and
the new R10/R11 entries, then sorted/deduped.

**R10/R11 freeing is deferred — full rationale documented as a
block comment immediately following `opcode_clobbers_r10_r11`.**
TL;DR: per-IR-opcode constraints don't cover the inter-
instruction codegen paths (function prologue rdi/rcx save through
R10/R11 around `rep stosq`; variadic save area; spilled-arg
prologue restore; FP / struct lowerings spanning multiple LIR
emissions per IR instruction). Adding R10/R11 to `allocatable()`
even with broad clobber declarations still segfaults CPython's
`_bootstrap_python` at deepfreeze generation. Genuine freeing
requires either a codegen refactor (~4000–8000 LOC) or a pre-IR
scratch-declaration side table (~1500–2500 LOC) — both are
multi-session milestones in their own right.

This commit ships the predicate + integration so that when
either prerequisite lands, adding R10/R11 to `Reg::allocatable()`
is the only change needed to start producing forbidden-set
entries.

XMM14/XMM15 freeing was also out of scope: the chordal
allocator's `color_xmm_bank` doesn't yet consume constraint
points at all (`ConstraintPoint<XmmReg>` is unused). Building
the XMM constraint plumbing is the structural prerequisite for
their freeing.

Verified:
- cc test suite 937+971+208 pass
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m33s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth and final commit in the constraint-system milestone.
Mirrors C4's x86_64 work for aarch64.

New `opcode_clobbers_aarch64_scratches(op)` predicate uses the
identical exclusion list as the x86_64 mirror (Br, Nop, Phi,
PhiSource, Unreachable, Fence, VaEnd). `AARCH64_SCRATCH_REGS`
lists the registers covered: X9, X10, X11 (the documented
`Reg::scratch_regs()` triple) plus X16, X17 (AAPCS64 IP0/IP1,
reused by the pair-address legalizer from commit `6af088eb`).

`get_constraint_info_aarch64` is extended along the x86_64 shape:
inline-asm path folds in the scratch clobbers on top of the C2
operand vocabulary, and the non-asm path emits a ConstraintPoint
for any dirty opcode.

**Freeing is deferred** — the deferred-freeing block in source
documents the path forward. Same reasons as x86_64's C4
(inter-instruction codegen paths in the prologue, callee-saved
save/restore, pair-address legalizer's X16 materialization, and
int128 lo/hi shuttle). V16/V17/V18 freeing has the additional
prerequisite of V-bank ConstraintPoint plumbing through
`color_vreg_bank`.

Adding the registers to `Reg::allocatable()` once the codegen
refactor lands is the only remaining change.

Verified:
- cc test suite 937+971+208 pass
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 29m52s)

This completes the constraint-system milestone (C1-C5). Summary
of the five commits:

- C1 (0f58a71): Source inline-asm + libc-call clobbers into
  ConstraintPoint. Plumbs new clobber sources through existing
  mechanism; fixes latent FP-pseudo-across-libc-call bug.
- C2 (fd034c7): Per-operand inline-asm constraint vocabulary +
  GCC syntax parser. Introduces OperandConstraint /
  OperandKind / InstrConstraints. Lowers to ConstraintPoint for
  back-compat.
- C3 (05be72d9): Allocator-aware inline-asm Fixed-operand
  pre-coloring. Asm operands with `"a"`/`"=a"` etc. land
  directly in the constraint-required register.
- C4 (4f9697b9): x86_64 scratch-clobber declaration
  infrastructure for R10/R11; freeing deferred.
- C5 (this commit): aarch64 mirror for X9/X10/X11/X16/X17;
  freeing deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the inline-asm memory-barrier milestone. Establishes
the consumer-facing surface and the structural invariant; no codegen
or allocator behavior change yet.

The constraint-system C2/C3 commits already plumbed
`InstrConstraints.memory_barrier` from the asm `"memory"` clobber
through both `build_asm_instr_constraints_{x86_64,aarch64}` builders.
What was missing: (a) a way for IR-pass code to ask the question
without going through the per-arch constraint builder, (b) explicit
documentation of the no-reorder invariant the existing optimizer
passes rely on, and (c) a validator check that catches accidental
drift between the barrier predicate and the DCE-root predicate.

Three pieces:

1. `Instruction::is_memory_barrier()` returns true for:
   - `Opcode::Asm` with `"memory"` in its clobber list — the
     `pause`/`yield` spin-loop, ticket-lock acquire, and
     `__sync_synchronize`-style fence idioms.
   - `Opcode::Fence` — explicit C11 `atomic_thread_fence`.
   - `Opcode::Atomic*` — every atomic memory op (Relaxed included;
     finer-grained one-sided predicates can refine later when a
     pass needs the distinction).
   - `Opcode::Call` — pcc has no escape/alias analysis, so any
     external call must be treated as a possible memory access on
     any reachable location. Conservative.
   - `Opcode::Setjmp` / `Opcode::Longjmp` — non-local control flow.

   The predicate is the surface future memory-reordering passes
   (GVN, LICM, load-store forwarding, machine scheduler) will
   consult before crossing any of the above.

2. Module-level no-reorder contract documentation added to
   `cc/ir/dce.rs` and `cc/ir/instcombine.rs`. The audit before this
   commit confirmed that DCE only deletes Nops (never reorders) and
   that InstCombine only rewrites Add/Sub/Mul/Div/Mod/Shl/Lsr/Asr/
   And/Or/Xor/SetEq/.../Neg/Not in place (never touches Load,
   Store, Asm, Call, Atomic*, Fence). Today the `"memory"` clobber
   "works" because no pass reorders memory ops at all — the
   contract codifies the implicit invariant so any future change
   that does start reordering must consult `is_memory_barrier()`.

3. New IR-validator invariant `I2 — MEMORY BARRIERS ARE DCE ROOTS`.
   Every instruction that satisfies `is_memory_barrier()` must also
   have `op.has_side_effects() == true`, or DCE would silently
   delete it. Today the two predicate sets align by construction;
   I2 surfaces drift immediately in debug builds rather than
   waiting for a memory-reordering pass to miscompile a real
   spinlock. New `ValidationError::BarrierWithoutSideEffect`
   variant. Validator runs only under `debug_assertions` via the
   existing `opt::optimize_module` debug_assert call site;
   release-mode pcc (CPython, all other -O builds) skips it
   for zero overhead.

Tests:
- `is_memory_barrier` cell at `cc/ir/mod.rs`: pure ops are not
  barriers; Fence + every Atomic* return true; Call/Setjmp/Longjmp
  return true; asm with `"memory"` returns true; asm with `"cc"` /
  named reg / empty clobbers / no asm_data return false.
- `build_asm_instr_constraints_x86_64_propagates_memory_barrier`
  and aarch64 mirror: build the per-arch InstrConstraints from a
  hand-built asm Instruction and assert `memory_barrier` matches
  the clobber list.
- `i2_barrier_implies_side_effect_structural` walks every barrier
  opcode and asserts the implication holds — a meta-test that
  catches drift if either predicate set changes.
- `i2_asm_with_memory_clobber_validates`: end-to-end positive case
  through `validate_function`.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 208 pass
  (+5 new tests; was 941 + 975 + 208 in C5)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- Partial CPython -O2 rebuild (`Python/ceval.o`) byte-size
  identical to prior pcc — release-mode codegen unchanged, as
  expected given C6a touches only debug-only validation + new
  helper method. Full CPython -O2 test pass is structural-only
  and not at risk; will be re-verified at end of C6 milestone
  (C6b lock-in tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jgarzik jgarzik requested a review from Copilot June 1, 2026 02:10
@jgarzik jgarzik self-assigned this Jun 1, 2026
Second commit in the inline-asm memory-barrier milestone. C6a
established the consumer-facing predicate
`Instruction::is_memory_barrier()` and documented the no-reorder
contract in `cc/ir/dce.rs` and `cc/ir/instcombine.rs`. C6b adds the
test moat that locks the contract in end-to-end so a future GVN /
LICM / load-store forwarding / machine-scheduler regression fails
loudly with the wrong return code rather than silently breaking a
real kernel-style spinlock under -O2.

Five new tests in `cc/tests/codegen/inline_asm.rs`, each exercising
a different heavily-used `asm volatile("..." ::: "memory")` idiom
and each run at both default and -O1 to catch the optimizer-only
regression case:

1. `codegen_inline_asm_memory_barrier_pointer_aliased_reload`
   Plain pointer-aliased store-store-load across an empty
   compiler-barrier asm. A pass that store-forwarded the stale
   value across the barrier would observe 42 instead of 99.

2. `codegen_inline_asm_memory_barrier_spin_wait_pattern`
   `pause` (x86_64) / `yield` (aarch64) spin loop polling a
   non-volatile location through a pointer. The memory-clobber
   asm inside the loop body must force `*p` to reload each
   iteration; if a hoisting pass lifted the load out of the loop,
   the loop would never terminate and `iters > 10` would trip.

3. `codegen_inline_asm_memory_barrier_full_fence`
   `mfence` (x86_64) / `dmb ish` (aarch64) full fence between two
   stores to the same location. The post-fence load must observe
   the second store, not the first.

4. `codegen_inline_asm_memory_barrier_double_checked_init`
   Classic DCI pattern: release barrier between `value = 42` and
   `initialized = 1` in the slow path; acquire barrier in the
   reader after seeing `initialized == 1`. Single-threaded so
   we're only locking in compiler-level ordering (which is exactly
   what's at risk from future passes), but the pattern matches the
   real cross-thread idiom one-for-one.

5. `codegen_inline_asm_memory_barrier_sync_synchronize_pattern`
   `__sync_synchronize`-equivalent: a single fence between two
   unrelated store-then-verify pairs. A pass that reordered the
   second store-then-verify across the fence would observe the
   wrong value.

The tests pass today on both x86_64 and aarch64 because no IR
pass currently reorders memory ops at all — they're forward
infrastructure. When the broader regalloc milestone moves into
M8 / M9 / M10 (which add real reordering capabilities), any pass
that drops a load, forwards a store, or hoists a load past an
`is_memory_barrier()` instruction will trip one of these tests.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 213 pass
  (+5 new tests; was 946 + 980 + 208 in C6a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR formalizes memory-ordering semantics in the IR and pushes inline-asm constraints into the register allocator. It adds an IR-level Instruction::is_memory_barrier() (asm "memory" clobber, Fence, Atomic*, Call, Setjmp, Longjmp), wires a validator invariant (I2) that every barrier must also be a DCE root via has_side_effects(), and introduces a shared cc/arch/asm_constraints.rs parser for GCC-style constraint strings. The x86_64 and aarch64 backends now thread asm clobbers/operands through get_constraint_info, pre-color Fixed operands in the chordal allocator, extend call-like detection to backend-emitted libc calls (Fabs*, Signbit*, Memcpy/Memset/Memmove), and replace the unsafe usize::MAX spill-slot drain with a monotonic-by-start commit. New runtime tests cover asm clobbers, asm-goto liveness, and fixed-register precoloring.

Changes:

  • Add Instruction::is_memory_barrier() + validator I2 (barrier ⇒ side-effect) with structural meta-test.
  • Add shared GCC inline-asm constraint vocabulary/parser and per-arch build_asm_instr_constraints_* / pre-color collectors; wire them into x86_64 and aarch64 chordal coloring.
  • Replace usize::MAX spill-slot drain with monotonic per-interval expiration to fix a within-block slot-reuse miscompile; extend find_call_positions with per-arch call-like predicates covering libc-emitting opcodes.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cc/ir/mod.rs Adds is_memory_barrier() + unit tests covering pure ops, atomics, calls, asm clobbers.
cc/ir/validate.rs Adds I2 invariant + BarrierWithoutSideEffect error variant and tests.
cc/ir/dce.rs, cc/ir/instcombine.rs Documents the memory-ordering contract these passes implicitly preserve.
cc/arch/mod.rs, cc/arch/asm_constraints.rs New shared inline-asm constraint parser and types.
cc/arch/regalloc.rs find_call_positions now takes a per-arch is_call_like predicate.
cc/arch/x86_64/regalloc.rs Asm clobber parsing, Fixed-operand precoloring, R10/R11 scratch declaration scaffolding, monotonic spill commit, libc-call list, tests.
cc/arch/aarch64/regalloc.rs Mirror of x86_64 changes (scratch list X9–X11/X16/X17, asm clobber parser, precolor plumbing, monotonic spill commit).
cc/tests/codegen/inline_asm.rs Adds asm-goto liveness test and fixed-register pre-color regression test.
cc/tests/codegen/misc.rs Adds defensive GP-live-across-asm-clobber test (passes both with/without C1).

Comment thread cc/arch/x86_64/regalloc.rs Outdated
Comment thread cc/arch/x86_64/regalloc.rs Outdated
Comment thread cc/arch/aarch64/regalloc.rs Outdated
Comment thread cc/arch/x86_64/regalloc.rs
jgarzik and others added 15 commits June 1, 2026 04:25
Copilot review of the C3 commit (PR comment at
cc/arch/x86_64/regalloc.rs lines 1569–1580) flagged a real
allocator/codegen split-view bug in `color_gp_bank`:

  pre_colored.entry(pid).or_insert(reg);  // ABI pin wins
  ...
  self.locations.insert(pid, Loc::Reg(reg));  // overwritten unconditionally!

When a pseudo is already pre-colored — either by an earlier
inline-asm Fixed operand or, more importantly, by
`allocate_arguments` (SysV pins int args 1..6 to RDI/RSI/RDX/RCX/
R8/R9) — `or_insert` correctly keeps the existing register in
`pre_colored`. But the subsequent `self.locations.insert` was
unconditional, overwriting the location to the asm-requested
register. Chordal coloring (which reads `pre_colored`) and
codegen (which reads `self.locations`) would then disagree:
coloring would place the value in the prior pin, codegen would
emit Load/Store against the new asm register, observing zero.

Fix: capture the value `or_insert` actually committed to
`pre_colored` and mirror exactly that into `self.locations`.
Idempotent when the prior pin equals the asm pin; preserves the
prior pin when they differ. Identical fix applied to the aarch64
mirror, where the bug is latent (AAPCS64 has no Fixed letters in
the C2 vocabulary, so `collect_asm_fixed_precolors_aarch64`
returns empty; but the fix matches the x86_64 commit byte-for-
byte to anchor the contract).

Tests:
- `codegen_inline_asm_x86_64_fixed_precolor_collides_with_abi_pin`
  exercises the path: function params used directly as `"+S"` /
  `"+d"` / `"+a"` asm operands in their ABI-pinned register slot.
  Today's linearizer stores ABI args to local stack slots at entry
  and asm operands load through a fresh pseudo, so the bug doesn't
  trigger end-to-end on current pcc — the test can't fail under
  the broken code today. But it anchors the codegen path for
  future arg-promotion passes that would make the arg pseudo and
  the asm operand identical, at which point a silent regression
  would otherwise have no compile-time signal. Run at both
  default `-O -g` and -O1.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass
  (+1 new test; was 946 + 980 + 213 in C6b)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged that both `build_asm_instr_constraints_x86_64`
and `_aarch64` carry a `let _ = OperandKind::Use;` statement
ostensibly silencing an unused-variant warning. The comment is
wrong: `OperandKind::Use` is constructed by `parse_constraint`
in `cc/arch/asm_constraints.rs:159` (the default kind for
non-output, non-early-clobber operands), so the variant is
actively used. The silencer line is dead code; removing it
along with the now-unused `OperandKind` import.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged `None?` as the fallthrough arm of
`parse_gp_clobber_name` in `cc/arch/aarch64/regalloc.rs:708`.
The x86_64 mirror at line 448 uses the more idiomatic
`return None`. Aligning aarch64 to match.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged that the inline-asm branch in `get_constraint_info`
unconditionally augments the clobber set with R10/R11 via
`opcode_clobbers_r10_r11`, but the explanatory comment immediately
below only applies to the non-asm path. The augmentation is real
and intentional (not unintentional): `emit_inline_asm` in
`cc/arch/x86_64/codegen.rs` uses R10/R11 for operand shuffling
into Fixed-letter registers, for `find_temp_reg` fallbacks when
allocated regs collide with the reserved-scratch set, and for
input/output spill helpers around the asm body. Today's reserved-
scratch convention makes this a no-op; when C4's freeing lands,
the augmentation becomes load-bearing for inline asm too.

Added an explanatory comment at the asm branch covering this.

No behavior change; doc-only.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the C9 milestone — full GCC-style inline-asm
constraint vocabulary. C9a is the parser-only piece: it adds the
`OperandConstraint::Alternatives(Vec<_>)` variant and teaches
`parse_constraint` to build it from multi-character bodies like
`"rm"`, `"ri"`, `"rmi"`, and `"g"` (sugar for `"rmi"`). The
allocator and codegen wiring lands in C9b; this commit ships
zero behavior change beyond letting these constraint strings
parse without an error.

Why now: portable kernel/UAPI headers (Linux `arch/x86/include/`,
glibc internals, FreeBSD sys/) routinely use `"rm"` to let the
compiler pick register vs memory based on what's cheapest at the
call site. Before C9 these compiled with a hard parse error;
after C9 they're accepted with a sensible default until the
codegen learns to actually exploit the flexibility.

Parser grammar (updated docstring):

  modifiers ::= '&'? ('=' | '+')?
  constraint ::= modifiers body
  body ::= class_letter+ | single_letter
  class_letter ::= 'r' | 'w' | 'm' | 'Q' | 'i' | 'n' | 'g'
  single_letter ::= class_letter | '0'..'9' | <Fixed letter>

Multi-character bodies build an `OperandConstraint::Alternatives`
whose inner list is restricted to `Any` / `Mem` / `Imm` — Fixed
letters and Match digits cannot appear inside (GCC's `"ra"` /
`"r0"` are nonsensical and rejected with `AlternativeWithFixed`).

Implementation notes:
- The single-character fast path still routes Fixed letters and
  Match digits through to their existing variants — no behavior
  change for C2/C3 code.
- `g` is special-cased to `Alternatives([Any, Mem, Imm])`; when
  it appears inside a multi-char body it's flattened.
- Duplicates dedup (e.g. `"rr"` → `Any`, `"rmm"` → `[Any, Mem]`).
- The old `ConstraintParseError::MultiAlternative` variant is
  replaced by the more specific `AlternativeWithFixed(char, _)`
  for the only remaining error case (class + Fixed/Match mix).
- `lower_instr_constraints_to_constraint_point` in both backends
  gets a new `Alternatives(_)` arm that's a no-op — by
  construction these alternatives don't add allocator clobbers.
  C9b will pick an alternative based on operand fit and feed it
  into emit_inline_asm.

Tests: +9 new parser unit tests in `cc/arch/asm_constraints.rs`:
- `parse_multi_alt_{rm,ri,rmi}` — each common combo
- `parse_g_is_sugar_for_rmi` — equivalence of `g` and `rmi`
- `parse_multi_alt_g_flattens` — `rg` produces the flattened set
- `parse_multi_alt_dedup` — `rr` / `rmm` collapse correctly
- `reject_alternative_with_fixed_letter` — `ra` / `=mb` rejected
- `reject_alternative_with_match_digit` — `r0` rejected
- `single_char_fixed_still_works` — C3 path preserved
- `single_char_match_still_works` — Match digits preserved

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 955 + 989 + 214 pass
  (+9 new parser tests; was 946 + 980 + 214)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

CPython smoke test deferred until C9b (when the new variant
actually influences codegen — today it can't reach Alternatives
from any real C source because the front end's inline-asm path
hasn't started exercising multi-char bodies; this commit just
sets up the foundation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in C9. C9a taught the parser to build
`OperandConstraint::Alternatives`; C9b wires the linearizer and
the x86_64 codegen to honor "register OR memory" and "register OR
immediate" semantics end-to-end. The user-visible result: code
like `__asm__("addl $1, %0" : "+rm"(x))` — the canonical
kernel/UAPI multi-alt idiom — now compiles and runs correctly
on both arches at default and -O.

The previous behavior had three bugs that combined to produce
silently wrong output on this pattern:

1. **Linearizer treated `"rm"` as `"m"`**. `parse_asm_constraint`
   set `is_memory = true` if any of `m`/`o`/`V`/`Q` appeared in
   the constraint string. For `"rm"` this fired even though
   register is acceptable, so the linearizer took the `lvalue`
   path (`linearize_lvalue`) and passed the operand's ADDRESS
   as the asm pseudo. The codegen later substituted `%1` with
   the register holding the address, so the asm template read
   the address bits as if they were the value.
2. **x86_64 codegen's `constraint_requires_register` returned
   true for any constraint containing `r`**. For `"rm"` this
   force-loaded spilled values into a temp register, defeating
   the whole point of multi-alt — the memory operand syntax was
   never used.
3. **x86_64 codegen's `constraint_requires_memory` returned
   true for any constraint containing `m`**. For `"rm"` it
   would force memory syntax even when register was clearly the
   right choice given the operand's location.

Fix:

- `cc/ir/linearize_stmt.rs::parse_asm_constraint` — `is_memory`
  is now strict: true iff every alternative listed is a memory
  class (`m`/`o`/`V`/`Q`) AND no non-memory class
  (`r`/`a`..`d`/`S`/`D`/`q`/`R`/`i`/`n`/`g`/`X`). For multi-alt
  strings like `"rm"` / `"rmi"` / `"g"`, the linearizer
  evaluates the operand as a value (`linearize_expr`) and lets
  the codegen substitute register vs memory syntax based on the
  operand's actual location.
- `cc/arch/x86_64/codegen.rs::constraint_requires_register` —
  returns true iff the constraint lists ONLY register-class
  letters. Any non-register letter (`m`/`o`/`V`/`Q`/`i`/`n`/`g`)
  means a spilled value need not be force-loaded to a temp.
- `cc/arch/x86_64/codegen.rs::constraint_requires_memory` —
  symmetric: returns true iff the constraint lists ONLY memory-
  class letters.

The aarch64 codegen needed no change — its `emit_inline_asm`
follows whatever location the allocator assigned without
classifying the constraint, so multi-alt has worked end-to-end
on aarch64 since C9a.

Tests: +5 new integration tests in `cc/tests/codegen/inline_asm.rs`:
- `codegen_inline_asm_multi_alt_rm_register_path` — `"+rm"` with
  register-resident value
- `codegen_inline_asm_multi_alt_m_memory_path` — `"+m"` with
  addr-taken operand
- `codegen_inline_asm_multi_alt_g_runtime_value` — `"g"` with a
  runtime int input
- `codegen_inline_asm_multi_alt_output_rm_to_memory` — `"=rm"`
  output to addr-taken local
- `codegen_inline_asm_multi_alt_mixed_operands` — two `"rm"`
  inputs + one `"=r"` output in the same asm (this one caught
  the linearizer-address bug)
Each runs at both default `-O -g` and -O1.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 955 + 989 + 219 pass
  (+5 new integration tests; was 955 + 989 + 214 in C9a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 28m24s. Linearizer hot-path change leaves
  the baseline intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third commit in the inline-asm constraint vocabulary milestone.
C2 covered the common register-class and Fixed letters; C9 added
multi-alternative constraints (`"rm"`, `"ri"`, `"g"`). C10 fills
out the rare per-arch letters that show up in glibc, kernel
headers, and other portable inline asm:

x86_64 class letters (`parse_x86_64_class_letter`):
- `q` — byte-class register. Modern x86_64 makes every GP register
  byte-accessible, so we map to `Any`.
- `R` — legacy 8-register set; same as `r` on x86_64. Maps to `Any`.
- `l` — index registers (RSI, RDI); same as `r` in practice. Maps
  to `Any`.
- `I`, `J`, `K`, `L`, `M`, `N`, `O` — constant-range immediates
  used for shift counts and small literals. All map to `Imm`.
- `X` — any operand (register/memory/immediate); same as `g`.

aarch64 class letters (`parse_aarch64_class_letter`):
- `I`, `J`, `K`, `L`, `M`, `N`, `S`, `Y`, `Z` — AAPCS64 immediate-
  range and zero-register letters. All map to `Imm`.

Mechanism: `parse_constraint` becomes a thin wrapper around the
new `parse_constraint_with_classes`, which accepts a second
closure resolving per-arch class letters in addition to the
existing Fixed-letter closure. Each backend supplies both via
`parse_x86_64_{fixed,class}_letter` / `parse_aarch64_{fixed,class}_letter`.
The `parse_constraint` wrapper passed `|_| None` as the class
mapper for backward compatibility, but turned out to be only
used by tests once the backends migrated to
`parse_constraint_with_classes`; removed it and migrated the
tests directly to keep `parse_constraint_with_classes` as the
single canonical entry point.

pcc does NOT range-check the immediate letters. If the supplied
operand is out of the letter's required range (e.g., `"I"` with
value 32), the assembler rejects the resulting asm template at
assemble time — the same failure mode GCC defaults to.

Codegen-side updates:
- `cc/arch/x86_64/codegen.rs::constraint_requires_register` —
  `q`/`R`/`l` join the register-class set; `g`/`X` join the
  memory-acceptable set. Immediate letters (`I`/`J`/`K`/...) do
  NOT make memory acceptable: combined with `r` (e.g. `"rI"`)
  a spilled value still force-loads.
- `cc/arch/x86_64/codegen.rs::constraint_requires_memory` —
  all new register-class and immediate-class letters listed
  explicitly so the function is exhaustive over C10's
  vocabulary.
- `cc/ir/linearize_stmt.rs::parse_asm_constraint` — the new
  letters explicitly opt out of `has_memory_class` so the
  linearizer evaluates the operand as a value, not an address.

aarch64's codegen needed no further updates — its
`emit_inline_asm` uses whatever location the allocator assigned
without classifying the constraint, so all class letters work
end-to-end once they reach the InstrConstraints lowering.

Tests:
- +6 new parser unit tests in `cc/arch/asm_constraints.rs`
  covering: class letter → Any, class letter → Imm, modifier
  combination (`=J`), multi-alt with class letter (`rK`),
  class-letter precedence over Fixed letter when both maps
  resolve the same char, unknown letter still errors.
- +3 new integration tests in `cc/tests/codegen/inline_asm.rs`:
  x86_64 `q` (byte-class register store/load), x86_64 `I`
  (compile-time shift count), x86_64 `R` (synonym for `r`).
- +1 aarch64 integration test for `K` (logical immediate).
- All 30 existing parser unit tests migrated to
  `parse_constraint_with_classes` with a no-op class map.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 961 + 995 + 222 pass
  (+6 parser unit, +3 integration on x86_64, +1 on aarch64; was
  955 + 989 + 219 in C9b — note 222 = 219 + 3 on x86_64 since the
  aarch64-only test is `#[cfg(target_arch = "aarch64")]`)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 28m49s. The linearizer + codegen letter-set
  expansions don't regress real-world code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closing the inline-asm constraint vocabulary milestone. C9c was
originally scoped as the codegen optimization that substitutes
a const-propagated operand as a literal (e.g. `mov $42, %eax`)
when the asm constraint allows immediate (`"i"`/`"n"`/`"ri"`/`"g"`
/x86_64 `"I"`/.../`"O"`). Audit found that `emit_inline_asm`'s
`Loc::Imm(v)` arm already does this implicitly: any operand whose
allocator location is `Loc::Imm` renders as `$value` regardless
of which class letter accepted it. Verified by inspecting the
generated assembly for `"ri"(42)`, `"g"(99)`, and `"I"(3)` —
each produces a direct `mov $N`/`shl $N` instruction with no
intermediate register load.

C9c therefore ships as lock-in tests only, no codegen change.
Four new tests in `cc/tests/codegen/inline_asm.rs`:

1. `codegen_inline_asm_imm_const_via_ri` — `"ri"(42)` substitutes
   as `$42`.
2. `codegen_inline_asm_imm_const_via_g` — `"g"(99)` substitutes
   as `$99`.
3. `codegen_inline_asm_imm_const_via_i_letter` — x86_64 `"I"(3)`
   substitutes as a shift count.
4. `codegen_inline_asm_imm_runtime_via_ri_uses_register` —
   negative case: `"ri"` with a runtime operand goes through a
   register, NOT substituted as literal. Pins the boundary so
   a future codegen change that, e.g., always force-loaded
   Loc::Imm operands wouldn't accidentally pass.

Each test runs at both default `-O -g` and -O1.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 961 + 995 + 226 pass
  (+4 new lock-in tests; was 961 + 995 + 222 in C10)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

CPython smoke test skipped — this commit is test-only, no
production-code change can have regressed the baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit conclusion: the Sreedhar et al. "Translating Out of SSA"
algorithm for parallel-copy sequentialization is already
implemented in `sequentialize_copies` and exercised end-to-end
by `test_phisource_swap_problem`. The remaining work for M8 is
lock-in coverage of cases the existing tests don't reach.

Two new unit tests in `cc/ir/lower.rs`:

1. `test_sequentialize_two_disjoint_cycles` — two independent
   2-cycles in one parallel-copy block. The algorithm must break
   each cycle with its own temp and emit all four destination
   writes. Catches a regression where the cycle-breaking loop
   would fail to terminate when multiple SCCs are present
   simultaneously.

2. `test_sequentialize_fp_cycle_propagates_type` — a 2-cycle on
   float-typed copies. The temp pseudo created to break the cycle
   inherits the source's type via the emitted Copy's `typ` field;
   `identify_fp_pseudos` then picks the temp up by scanning
   instruction types. Without this propagation the temp would be
   misallocated to the GP bank and corrupt the FP cycle's values.

No production-code change — the algorithm was already correct.
These tests pin the contract so a future refactor of
`sequentialize_copies` (e.g., adopting Briggs's window-based
2-temp-budget approach) doesn't accidentally lose the multi-SCC
or type-propagation properties.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass
  (+2 new tests; was 961 + 995 + 226 in C9c)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 29m34s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the M9 register-coalescing milestone. M9a is the
codegen-side foundation: when `emit_copy_with_type` sees the
allocator placed src and dst at the same `Loc`, skip emission
entirely instead of emitting `mov %eax, %eax` etc. Gated on
`actual_size >= 32` (so narrow-type 8/16 bit truncations via
AND/SHL+SAR still run) and `!is_fp_copy` (FP path has size-
specific quirks).

Measurement on CPython 3.12.9 `Python/ceval.o` showed 105 of
10,653 reg-to-reg movs were "identity" (`mov %X, %X`) under the
old codegen, but 104 of those 105 are the deliberate
`mov %eax, %eax` zero-extension idiom that pcc emits to clear
the upper 32 bits after a 64-bit op — those go through a
different path (the explicit narrowing in emit_move and the
operand-size machinery) and are NOT routed through
emit_copy_with_type. Only 1 actual 64-bit identity Copy fires
the new short-circuit in ceval.o.

So M9a alone is a near-no-op on real code. Its role is
**infrastructure for M9b**: once M9b's allocator-level
coalescing lands and forces matched Copy operands to share a
location, every successfully-coalesced Copy collapses to
identity here and is elided. M9a is the elision sink that makes
M9b's gains visible.

`Loc` on aarch64 now derives `PartialEq` (needed by the
src_loc == dst_loc check). x86_64's Loc already derived it.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass
  (unchanged from M8)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 30m17s. Baseline holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in the M9 register-coalescing milestone, and the
piece that produces a measurable win. After the chordal allocator
finishes assigning every GP pseudo a `Loc`, walk every
`Opcode::Copy { target: t, src: [s] }` in the function and migrate
`t`'s location to `s`'s location whenever the move is safe. The
Copy then becomes identity and M9a's `emit_copy_with_type`
short-circuit elides it.

Migration is safe iff:
- both `t` and `s` are in the bank being coalesced (GP for
  `color_gp_bank`);
- `t` is not pre-colored (ABI-pinned arg or asm `Fixed`-letter
  operand; moving it would violate the constraint that pinned it);
- `t` and `s` do not interfere (the IR doesn't ask them to hold
  distinct values simultaneously);
- `t`'s forbidden-color set doesn't contain `s`'s register (the
  constraint system reserved that register against `t`, e.g. a
  cross-call pseudo against caller-saved regs);
- no neighbor of `t` already holds `s`'s register (would create a
  same-color conflict at the new location).

Naive non-Briggs heuristic: per-candidate decision, no global
degree analysis, no spill-regression guard. Empirically the
chordal allocator's existing spill paths absorb any edge cases.

Measurement on CPython 3.12.9 `Python/ceval.o` (-O2):
- reg-to-reg movs: 10,653 → 10,614 (-39, -0.4%)
- .o size:        395,480 → 389,432 bytes (-6,048, -1.5%)

Small but real. The mov-savings come predominantly from phi-
elimination copies (every loop's back edge in `_PyEval_EvalFrameDefault`'s
massive dispatch loop) and matched-asm-constraint init copies
(closes the C8 deferral implicitly — the linearizer's explicit
`Copy(out_pseudo, val)` for `"0"(x)`-style constraints now folds
into M9a elision when the operands' lifetimes don't interfere).

C8 was deferred because matched-asm coalescing wanted a Briggs
heuristic to avoid spilling. The post-coloring approach
sidesteps that issue: the chordal allocator has already made all
the placement decisions; coalescing only swaps locations after
the fact when the swap is safe. No new spills introduced.

Implementation:
- `cc/arch/regalloc.rs::find_copy_coalesce_candidates` — walk
  Opcode::Copy candidates.
- `cc/arch/x86_64/regalloc.rs::color_gp_bank` — coalesce after
  Phase 3 commit, before spill commits.
- `cc/arch/aarch64/regalloc.rs::color_gp_bank` — mirror.
- FP banks (`color_xmm_bank` / `color_vreg_bank`) intentionally
  left out of M9b — FP-typed Copies are rarer and the codegen's
  FP move paths have size-specific quirks that the naive
  migration doesn't account for. A future M9c could extend.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass
  (unchanged from M9a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 29m28s. Allocator-level coalescing leaves
  the baseline intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…locks

Final commit in the M* regalloc-redesign milestone. M10 is the
"dead-code freeze" piece: defensive validator infrastructure that
locks in the optimizer's structural contract before the codegen
layer consumes the IR.

New invariant I3: every branch-style instruction's `BasicBlockId`
reference (Br's `bb_true`, Cbr's `bb_true` + `bb_false`, Switch's
`switch_cases` + `switch_default`, inline asm-goto's
`AsmData.goto_labels`) must point at a block actually present in
`func.blocks`. A reference to a deleted or never-created block ID
crashes the codegen layer's label-resolution pass with a
confusing error far from the source bug; I3 catches the
corruption at the optimizer/codegen boundary instead, with a
clear `ValidationError::InvalidBranchTarget { block, index,
opcode, target }` diagnostic.

Like I1 and I2, the check runs under `debug_assertions` only —
release-mode pcc skips it for zero overhead. The validator gate
at `cc/opt.rs::optimize_module` already calls `validate_module`
under `debug_assert!`, so I3 is wired automatically.

Tests:
- `i3_valid_branch_target_passes` — well-formed Br succeeds.
- `i3_invalid_br_target_flagged` — Br to a nonexistent block
  produces `InvalidBranchTarget`.
- `i3_invalid_cbr_false_target_flagged` — Cbr's `bb_false` is
  checked independently from `bb_true`.

I4/I5/I6 (Copy bank consistency, phi predecessor CFG agreement,
Load/Store address-bank) would each need additional plumbing
(TypeTable or CFG-predecessor maps) for marginal defensive
value. Out of scope.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
  (+3 new tests; was 963 + 997 + 226 in M9b)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 28m06s. Validator-only change, no codegen
  impact in release builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes a long-standing aarch64 codegen gap exposed by C9b's new
`codegen_inline_asm_multi_alt_m_memory_path` test on the
macos-aarch64 CI worker. The test compiles:

  int x = 99;
  int *p = &x;
  __asm__("ldr w8, %0\n\tadd w8, w8, #1\n\tstr w8, %0" : "+m"(*p));

For the `"+m"(*p)` constraint, the linearizer takes the lvalue
(returning an address pseudo). The chordal allocator places that
address pseudo in a register. Old codegen substituted `%0` with
the bare register name (`w0`), producing:

  ldr w8, w0     <- error: expected label or encodable integer pc offset
  str w8, w0     <- error: invalid operand for instruction

The aarch64 assembler correctly rejects this because `ldr`/`str`
expect a memory operand (e.g. `[x0]`), not a register name.

Fix: in `emit_inline_asm`, when the constraint is memory-class
(`m`/`o`/`V`/`Q` only — `"rm"` and `"g"` etc. are excluded since
register or immediate is also acceptable), AND the operand's
location is `Loc::Reg(r)`, substitute the operand as `[xN]`
instead of the bare register name. This matches the x86_64
mirror's `Loc::Reg(r) if requires_mem` arm at
`cc/arch/x86_64/codegen.rs:3622`.

New helper `Self::constraint_requires_memory(constraint: &str)`
mirrors x86_64's equivalent (added in C9b). It counts as
"requires memory" only when every alternative in the constraint
is a memory class — single-class `"m"` returns true; multi-alt
`"rm"`/`"g"` return false so register substitution still wins
when the operand fits in a register.

Out of scope for this commit: the `Loc::Stack` / `Loc::IncomingArg`
case for memory-class operands. If the address pseudo itself
spills to the stack, the current fallthrough hits
`loc_to_asm_string`, which gives `[sp, #offset]` — the address
of the spill slot rather than the value stored in it. A correct
fix would emit a pre-asm load to a temp register and use that
register as `[xN]`. None of the existing tests exercise the
spilled-address path (the test functions are too small to push
the address pseudo out of registers), and a fix would require
the same input-load / output-restore plumbing x86_64 has at
`emit_inline_asm`. Deferred until a real spill case surfaces.

Verified locally:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
  (no change in counts; the failing test runs only on aarch64)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean

Linux-amd64 cargo tests were green before this change; the fix
is to the aarch64-only emit_inline_asm path and cannot regress
x86_64 codegen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two backends' emit_inline_asm previously maintained four
parallel Vecs for each operand's substitution data:

  let mut operand_regs:  Vec<Option<Reg>>     = ...;
  let mut operand_sizes: Vec<u32>             = ...;
  let mut operand_mem:   Vec<Option<String>>  = ...;
  let mut operand_names: Vec<Option<String>>  = ...;

Every branch of the operand-classifying match pushed to (some of)
these vecs in lockstep -- there were 14 push pairs across the two
backends, each one a bug waiting for a missed push to silently
de-sync the arrays.

Replaced with a single AsmOperandSlot<R> struct in
cc/arch/codegen.rs:

  pub struct AsmOperandSlot<R> {
      pub reg:  Option<R>,
      pub mem:  Option<String>,
      pub size: u32,
      pub name: Option<String>,
  }

and a single Vec<AsmOperandSlot<Reg>> per emit_inline_asm. Each
branch builds and pushes one slot, with all four fields populated
by construction. A per-iteration mk(reg, mem) closure captures
the operand's size and name once at the top of the loop so each
branch only specifies the (reg, mem) pair appropriate to its
outcome.

substitute_asm_operands (shared in cc/arch/codegen.rs) and the
per-backend wrappers now take a single &[AsmOperandSlot<F::Reg>]
slice instead of four parallel slices. The substitution logic
itself is unchanged -- only the indexing syntax.

Behavior-preserving. The bookkeeping vecs that DO have distinct
roles -- output_moves / input_moves / remap_setup / remap_restore
/ pseudo_to_temp -- stay separate; they're not parallel to the
per-operand substitution data.

Verified:
- cargo build --release clean
- cargo test --release -p posixutils-cc: 966 + 1000 + 226 pass
  (unchanged)
- cargo clippy --all-targets zero warnings
- cargo fmt --all -- --check clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 30m41s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes that together restore GCC-correct semantics for
memory-class inline asm outputs (`"=m"(*p)`, `"+m"(*p)`, and
their `o`/`V`/`Q` siblings). The macos-aarch64 CI was failing on
the C9b `codegen_inline_asm_multi_alt_m_memory_path` test
because of the codegen half of this issue; the test was passing
on x86_64 only by accident (the old behaviour quietly used
register syntax and a post-asm Store to paper over the bug).

Linearizer change (cc/ir/linearize_stmt.rs):

Previously, for `"=m"(*p)` the linearizer:
  - allocated a fresh pseudo for the output,
  - on `is_readwrite` (`"+m"`) emitted Load(pseudo, addr) — loading
    the *value* at *p into the pseudo,
  - inserted a post-asm Store(pseudo, addr) writing back.

The pseudo therefore held a *value*, while the constraint
declared the operand as memory. The asm template substitution
treated `%0` as a register holding the value (x86) or the value
loaded into a w-register (aarch64), and on aarch64 the assembler
rejected the resulting `ldr w8, w0` outright. On x86_64 it
"worked" via the Load+Store dance — but the asm `addl $1, %0`
became `addl $1, %eax`, modifying a register that subsequently
got stored back to memory. Wrong shape; happens to compute the
right answer for simple increment patterns.

Now, when the output's constraint is memory-class
(`is_memory == true`):
  - the output pseudo is the *address* itself, taken from
    `linearize_lvalue(&op.expr)`;
  - no initial Load is emitted;
  - the matched `+m` input shares the same address pseudo;
  - the post-asm Store is skipped (new `skip_post_handling`
    parallel-vec flag — separate from the existing `param_outputs`
    so the three states "skip", "store to addr", "update var_map"
    are all expressible).

The asm template's `%0` then resolves to the memory operand
syntax with the address register: `[xN]` on aarch64,
`(%rN)` on x86_64. The hardware ldr/str / addl-to-memory does
the actual read/write through that address.

Codegen change (cc/arch/x86_64/codegen.rs):

Added the missing `Loc::Reg(r) if requires_mem` arm on the
*output* side of `emit_inline_asm`. The input side already had
the corresponding arm at line 3642 (`format!("(%{})", ...)`);
the output side fell through to bare-register substitution,
so `addl $1, %0` substituted `%eax` instead of `(%rax)`. The
new arm mirrors the input side: render the address register
as `(%rN)`.

aarch64's output side already had the equivalent `[xN]` arm
from the macos-aarch64 fix commit; no change needed there.

Why CPython doesn't show the underlying bug:

A grep of CPython 3.12.9's Python/, Modules/, and Include/
trees shows zero uses of `"=m"` or `"+m"` outputs in code that
gets compiled under our pcc build. `pymath.c`'s `fnstcw`/`fldcw`
asms live under `#ifdef HAVE_GCC_ASM_FOR_X87`, which is unset
in our configuration. So this fix is exercised entirely by the
posixutils-cc inline-asm test suite, not by CPython workloads.

Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
  (the previously-failing `codegen_inline_asm_multi_alt_m_memory_path`
  now passes on both arches; x86_64 path verified locally,
  aarch64 path verified by inspection — the codegen produces
  `[xN]` syntax via the existing requires_mem arm)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
  pass, 0 failures, 28m56s. (A first run flaked on test_venv's
  multiprocessing spawn at minute 28 — the failure was a
  resource-accumulation artifact of the fork-heavy test parade
  ahead of it; running test_venv in isolation passes, and the
  retry of the full suite passes cleanly.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jgarzik jgarzik merged commit 5fa0713 into main Jun 2, 2026
9 checks passed
@jgarzik jgarzik deleted the updates branch June 2, 2026 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants