Conversation
First commit in the per-instruction operand/clobber constraint
system milestone. No new constraint vocabulary yet — just plumbs
two new sources of clobbers through the existing ConstraintPoint
mechanism so the chordal allocator forbids them.
1. **Inline-asm explicit clobbers.** `Opcode::Asm` now produces
constraints from `AsmData.clobbers`. Named GP register strings
(rax/eax/al/r10d/etc on x86_64; x0/w0/fp/lr/etc on aarch64)
are mapped to `Reg` via per-arch `parse_gp_clobber_name`.
`"memory"` and `"cc"` are filtered out (the former gets full
barrier semantics in C3; the latter is a no-op since we don't
track condition-code liveness).
2. **Libc-call-emitting opcodes.** `find_call_positions` is
parameterised on an `is_call_like(Opcode) -> bool` predicate.
x86_64 extends the base set with `Fabs32/64`, `Signbit32/64`,
`Memset/Memcpy/Memmove`; aarch64 extends with the same FP
builtins (its Memset/Memcpy/Memmove reach the regular
`Opcode::Call` path so are already covered). These IR opcodes
lower to direct `Bl`/`Call` in `features.rs`, but linear scan
hid the latent bug by never picking the at-risk regs;
chordal coloring would expose it.
aarch64 `get_constraint_info_aarch64` is built out (previously
returned None for every instruction).
Tests added:
- x86_64 unit tests for `parse_gp_clobber_name` (size aliases,
leading `%`, case insensitive, unknown/cc/memory rejection)
- aarch64 unit tests for the same + special names (fp/lr/sp/xzr).
- Both `is_call_like_*` predicates anchored so a future codegen
change that adds a new builtin->libc lowering doesn't silently
drop out.
- `codegen_gp_live_across_inline_asm_clobber` defensive E2E test
in cc/tests/codegen/misc.rs.
Verified:
- cc test suite 926+960+206 pass (13 new tests)
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m02s)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in the constraint-system milestone. Introduces the
per-operand vocabulary that C3 will start consuming, but lowers it
back to the existing `ConstraintPoint` for now so allocator
behaviour is unchanged.
New module: `cc/arch/asm_constraints.rs` defines
enum OperandConstraint<R> { Any | Fixed(R) | Match(usize) | Mem | Imm }
enum OperandKind { Use | Def | UseDef | EarlyClobber }
struct InstrConstraints<R> { operands, clobbers, memory_barrier }
struct OperandSpec<R> { pseudo, kind, constraint }
enum ConstraintParseError { Empty, EarlyClobberOnInput, MultiAlternative, UnknownLetter }
fn parse_constraint(s, letter_map) -> Result<(kind, con), error>
Supported syntax (C2 scope): `r`/`w` (Any), `=r`/`+r`/`&=r`
(Def/UseDef/EarlyClobber), `0`–`9` (Match), `m`/`Q` (Mem),
`i`/`n` (Imm), plus per-arch Fixed letters. Multi-alternative
constraints (`rm`/`ri`/`g`) and the rare per-arch letters
(`q`/`Q`/`R`/`l`/`t`/`I`/`J`/...) are out of scope and produce a
diagnostic — they become a follow-up after C5.
Per-arch resolvers:
- `parse_x86_64_fixed_letter` maps `a/b/c/d/S/D` → Rax/Rbx/Rcx/Rdx/Rsi/Rdi.
- `parse_aarch64_fixed_letter` returns None for every letter
(AAPCS64 has no C2-scope Fixed letters; all aarch64 register
operands use `r` or `w`). The function exists for symmetry
and as the future-extension point.
Integration: x86_64 and aarch64 `get_constraint_info` now route
inline asm through `build_asm_instr_constraints_*` →
`lower_instr_constraints_to_constraint_point_*`. The lowering
promotes Fixed operands into the clobber set and notes Match
operands (for C3's coalescing edges) and EarlyClobber (for C3's
extra interference). `memory_barrier` is captured but consumed
in C3.
11 new unit tests in `cc/arch/asm_constraints.rs` cover every
modifier+letter combination, EarlyClobber semantics, parse errors,
and rejected multi-alternative strings.
Verified:
- cc test suite 937+971+206 pass (11 new parser tests)
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m37s)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third commit in the constraint-system milestone. Wires the
chordal allocator to consume `OperandConstraint::Fixed(R)` so
the asm operand pseudo lands directly in the constraint-required
register, eliminating the codegen's "move into fixed register
around asm" defensive pattern in the common case.
New per-arch helper `collect_asm_fixed_precolors_{x86_64,aarch64}`
walks `func.blocks` for `Opcode::Asm`, builds the C2
`InstrConstraints` view, and collects every operand carrying a
`Fixed(R)` constraint. The result is folded into chordal coloring
via the `pre_colored` map in `color_gp_bank`.
One gotcha caught and documented: Phase 3's commit loop skips
pre-colored vertices on the assumption that their location is
already in `self.locations` (true for ABI-pinned args, which
`allocate_arguments` inserts before chordal runs). Inline-asm
Fixed pre-colors arrive in `color_gp_bank` without going through
`allocate_arguments`, so they also need a direct
`self.locations.insert`. If missed, `get_location` returns
`Loc::Imm(0)` as the default and every Store/Load involving the
operand silently writes/reads zero — caught by
`codegen_inline_asm_x86_64_mega::SPECIFIC_REGISTERS` returning 60
before the fix was added.
Tests:
- `codegen_inline_asm_x86_64_fixed_register_precolor` exercises
every Fixed letter in C2 scope (a, b, c, d, S, D) in both `=`
and bare positions.
- `codegen_inline_asm_x86_64_asm_goto_pseudo_survives_edge`
documents and locks in the existing asm-goto CFG plumbing.
The linearizer in `cc/ir/linearize_stmt.rs:2100-2114` already
`link_bb`s the asm block to every goto target plus the
fall-through, so liveness propagates correctly. The plan
worried this might be missing; verified that it's not.
Out of scope for this commit (deferred follow-ups):
- `OperandConstraint::Match(idx)` coalescing — the existing
linearizer emits an explicit Copy from input pseudo to the
matched output pseudo so they share a value at the IR level;
the allocator already places them in the same location without
needing an explicit interference-graph coalescing pass.
- `InstrConstraints.memory_barrier` strict semantics — pcc's
mem2reg is already in place (`cc/ir/mem2reg.rs` runs after
`cc/ir/ssa.rs::ssa_convert` at `cc/ir/linearize.rs:1089`),
so promoted locals have no stack slot for an asm `"memory"`
clobber to invalidate, and memory-resident data (addr-taken,
volatile, atomic, non-scalar) is always accessed through
explicit `Load` / `Store` IR ops that pcc doesn't reorder
across an asm. Strict barrier semantics are therefore already
observed today; `memory_barrier` is captured on
`InstrConstraints` for completeness and to anchor future
passes that might add memory-access reordering (LICM, GVN,
SCCP-style memory dataflow) and need an explicit barrier
fence point.
- Codegen rewrite that drops the mov-around-asm fallback. The
existing fallback in `emit_inline_asm` is now a no-op for
pre-colored operands but still handles the spilled-operand
case correctly. Removing it is a quality improvement, not
needed for correctness.
aarch64 plumbing wired through for symmetry. AAPCS64 has no
C2-scope Fixed letters today so `collect_asm_fixed_precolors_aarch64`
always returns an empty map — but the future-extension point is
in place.
Verified:
- cc test suite 937+971+207 pass (2 new tests)
- cargo clippy --all-targets clean
- cargo fmt --check clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m09s)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth commit in the constraint-system milestone. Lands the per-IR-opcode constraint-declaration infrastructure for freeing R10 and R11 from the reserved-scratch list, but does NOT yet add them to `Reg::allocatable()` — see the deferred-freeing block in the source for the empirical reasons why. `opcode_clobbers_r10_r11(op)` is a conservative predicate: every IR opcode whose codegen helper is more than a single instruction is assumed to touch R10 and/or R11. The truly clean opcodes (`Br`, `Nop`, `Phi`, `PhiSource`, `Unreachable`, `Fence`, `VaEnd`) are explicitly excluded. The predicate feeds `get_constraint_info`, so a constraint point with R10/R11 in its clobber set is built for every dirty IR instruction. The integration into `get_constraint_info` covers both the inline-asm path (already routed through `build_asm_instr_constraints_x86_64` from C2) and the regular opcode-level path. Clobber sets are merged from `opcode_constraints` (idiv/shift/vaarg hardware constraints) and the new R10/R11 entries, then sorted/deduped. **R10/R11 freeing is deferred — full rationale documented as a block comment immediately following `opcode_clobbers_r10_r11`.** TL;DR: per-IR-opcode constraints don't cover the inter- instruction codegen paths (function prologue rdi/rcx save through R10/R11 around `rep stosq`; variadic save area; spilled-arg prologue restore; FP / struct lowerings spanning multiple LIR emissions per IR instruction). Adding R10/R11 to `allocatable()` even with broad clobber declarations still segfaults CPython's `_bootstrap_python` at deepfreeze generation. Genuine freeing requires either a codegen refactor (~4000–8000 LOC) or a pre-IR scratch-declaration side table (~1500–2500 LOC) — both are multi-session milestones in their own right. This commit ships the predicate + integration so that when either prerequisite lands, adding R10/R11 to `Reg::allocatable()` is the only change needed to start producing forbidden-set entries. XMM14/XMM15 freeing was also out of scope: the chordal allocator's `color_xmm_bank` doesn't yet consume constraint points at all (`ConstraintPoint<XmmReg>` is unused). Building the XMM constraint plumbing is the structural prerequisite for their freeing. Verified: - cc test suite 937+971+208 pass - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 28m33s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth and final commit in the constraint-system milestone. Mirrors C4's x86_64 work for aarch64. New `opcode_clobbers_aarch64_scratches(op)` predicate uses the identical exclusion list as the x86_64 mirror (Br, Nop, Phi, PhiSource, Unreachable, Fence, VaEnd). `AARCH64_SCRATCH_REGS` lists the registers covered: X9, X10, X11 (the documented `Reg::scratch_regs()` triple) plus X16, X17 (AAPCS64 IP0/IP1, reused by the pair-address legalizer from commit `6af088eb`). `get_constraint_info_aarch64` is extended along the x86_64 shape: inline-asm path folds in the scratch clobbers on top of the C2 operand vocabulary, and the non-asm path emits a ConstraintPoint for any dirty opcode. **Freeing is deferred** — the deferred-freeing block in source documents the path forward. Same reasons as x86_64's C4 (inter-instruction codegen paths in the prologue, callee-saved save/restore, pair-address legalizer's X16 materialization, and int128 lo/hi shuttle). V16/V17/V18 freeing has the additional prerequisite of V-bank ConstraintPoint plumbing through `color_vreg_bank`. Adding the registers to `Reg::allocatable()` once the codegen refactor lands is the only remaining change. Verified: - cc test suite 937+971+208 pass - cargo clippy --all-targets clean - cargo fmt --check clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 29m52s) This completes the constraint-system milestone (C1-C5). Summary of the five commits: - C1 (0f58a71): Source inline-asm + libc-call clobbers into ConstraintPoint. Plumbs new clobber sources through existing mechanism; fixes latent FP-pseudo-across-libc-call bug. - C2 (fd034c7): Per-operand inline-asm constraint vocabulary + GCC syntax parser. Introduces OperandConstraint / OperandKind / InstrConstraints. Lowers to ConstraintPoint for back-compat. - C3 (05be72d9): Allocator-aware inline-asm Fixed-operand pre-coloring. Asm operands with `"a"`/`"=a"` etc. land directly in the constraint-required register. - C4 (4f9697b9): x86_64 scratch-clobber declaration infrastructure for R10/R11; freeing deferred. - C5 (this commit): aarch64 mirror for X9/X10/X11/X16/X17; freeing deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the inline-asm memory-barrier milestone. Establishes
the consumer-facing surface and the structural invariant; no codegen
or allocator behavior change yet.
The constraint-system C2/C3 commits already plumbed
`InstrConstraints.memory_barrier` from the asm `"memory"` clobber
through both `build_asm_instr_constraints_{x86_64,aarch64}` builders.
What was missing: (a) a way for IR-pass code to ask the question
without going through the per-arch constraint builder, (b) explicit
documentation of the no-reorder invariant the existing optimizer
passes rely on, and (c) a validator check that catches accidental
drift between the barrier predicate and the DCE-root predicate.
Three pieces:
1. `Instruction::is_memory_barrier()` returns true for:
- `Opcode::Asm` with `"memory"` in its clobber list — the
`pause`/`yield` spin-loop, ticket-lock acquire, and
`__sync_synchronize`-style fence idioms.
- `Opcode::Fence` — explicit C11 `atomic_thread_fence`.
- `Opcode::Atomic*` — every atomic memory op (Relaxed included;
finer-grained one-sided predicates can refine later when a
pass needs the distinction).
- `Opcode::Call` — pcc has no escape/alias analysis, so any
external call must be treated as a possible memory access on
any reachable location. Conservative.
- `Opcode::Setjmp` / `Opcode::Longjmp` — non-local control flow.
The predicate is the surface future memory-reordering passes
(GVN, LICM, load-store forwarding, machine scheduler) will
consult before crossing any of the above.
2. Module-level no-reorder contract documentation added to
`cc/ir/dce.rs` and `cc/ir/instcombine.rs`. The audit before this
commit confirmed that DCE only deletes Nops (never reorders) and
that InstCombine only rewrites Add/Sub/Mul/Div/Mod/Shl/Lsr/Asr/
And/Or/Xor/SetEq/.../Neg/Not in place (never touches Load,
Store, Asm, Call, Atomic*, Fence). Today the `"memory"` clobber
"works" because no pass reorders memory ops at all — the
contract codifies the implicit invariant so any future change
that does start reordering must consult `is_memory_barrier()`.
3. New IR-validator invariant `I2 — MEMORY BARRIERS ARE DCE ROOTS`.
Every instruction that satisfies `is_memory_barrier()` must also
have `op.has_side_effects() == true`, or DCE would silently
delete it. Today the two predicate sets align by construction;
I2 surfaces drift immediately in debug builds rather than
waiting for a memory-reordering pass to miscompile a real
spinlock. New `ValidationError::BarrierWithoutSideEffect`
variant. Validator runs only under `debug_assertions` via the
existing `opt::optimize_module` debug_assert call site;
release-mode pcc (CPython, all other -O builds) skips it
for zero overhead.
Tests:
- `is_memory_barrier` cell at `cc/ir/mod.rs`: pure ops are not
barriers; Fence + every Atomic* return true; Call/Setjmp/Longjmp
return true; asm with `"memory"` returns true; asm with `"cc"` /
named reg / empty clobbers / no asm_data return false.
- `build_asm_instr_constraints_x86_64_propagates_memory_barrier`
and aarch64 mirror: build the per-arch InstrConstraints from a
hand-built asm Instruction and assert `memory_barrier` matches
the clobber list.
- `i2_barrier_implies_side_effect_structural` walks every barrier
opcode and asserts the implication holds — a meta-test that
catches drift if either predicate set changes.
- `i2_asm_with_memory_clobber_validates`: end-to-end positive case
through `validate_function`.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 208 pass
(+5 new tests; was 941 + 975 + 208 in C5)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- Partial CPython -O2 rebuild (`Python/ceval.o`) byte-size
identical to prior pcc — release-mode codegen unchanged, as
expected given C6a touches only debug-only validation + new
helper method. Full CPython -O2 test pass is structural-only
and not at risk; will be re-verified at end of C6 milestone
(C6b lock-in tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in the inline-asm memory-barrier milestone. C6a
established the consumer-facing predicate
`Instruction::is_memory_barrier()` and documented the no-reorder
contract in `cc/ir/dce.rs` and `cc/ir/instcombine.rs`. C6b adds the
test moat that locks the contract in end-to-end so a future GVN /
LICM / load-store forwarding / machine-scheduler regression fails
loudly with the wrong return code rather than silently breaking a
real kernel-style spinlock under -O2.
Five new tests in `cc/tests/codegen/inline_asm.rs`, each exercising
a different heavily-used `asm volatile("..." ::: "memory")` idiom
and each run at both default and -O1 to catch the optimizer-only
regression case:
1. `codegen_inline_asm_memory_barrier_pointer_aliased_reload`
Plain pointer-aliased store-store-load across an empty
compiler-barrier asm. A pass that store-forwarded the stale
value across the barrier would observe 42 instead of 99.
2. `codegen_inline_asm_memory_barrier_spin_wait_pattern`
`pause` (x86_64) / `yield` (aarch64) spin loop polling a
non-volatile location through a pointer. The memory-clobber
asm inside the loop body must force `*p` to reload each
iteration; if a hoisting pass lifted the load out of the loop,
the loop would never terminate and `iters > 10` would trip.
3. `codegen_inline_asm_memory_barrier_full_fence`
`mfence` (x86_64) / `dmb ish` (aarch64) full fence between two
stores to the same location. The post-fence load must observe
the second store, not the first.
4. `codegen_inline_asm_memory_barrier_double_checked_init`
Classic DCI pattern: release barrier between `value = 42` and
`initialized = 1` in the slow path; acquire barrier in the
reader after seeing `initialized == 1`. Single-threaded so
we're only locking in compiler-level ordering (which is exactly
what's at risk from future passes), but the pattern matches the
real cross-thread idiom one-for-one.
5. `codegen_inline_asm_memory_barrier_sync_synchronize_pattern`
`__sync_synchronize`-equivalent: a single fence between two
unrelated store-then-verify pairs. A pass that reordered the
second store-then-verify across the fence would observe the
wrong value.
The tests pass today on both x86_64 and aarch64 because no IR
pass currently reorders memory ops at all — they're forward
infrastructure. When the broader regalloc milestone moves into
M8 / M9 / M10 (which add real reordering capabilities), any pass
that drops a load, forwards a store, or hoists a load past an
`is_memory_barrier()` instruction will trip one of these tests.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 946 + 980 + 213 pass
(+5 new tests; was 946 + 980 + 208 in C6a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR formalizes memory-ordering semantics in the IR and pushes inline-asm constraints into the register allocator. It adds an IR-level Instruction::is_memory_barrier() (asm "memory" clobber, Fence, Atomic*, Call, Setjmp, Longjmp), wires a validator invariant (I2) that every barrier must also be a DCE root via has_side_effects(), and introduces a shared cc/arch/asm_constraints.rs parser for GCC-style constraint strings. The x86_64 and aarch64 backends now thread asm clobbers/operands through get_constraint_info, pre-color Fixed operands in the chordal allocator, extend call-like detection to backend-emitted libc calls (Fabs*, Signbit*, Memcpy/Memset/Memmove), and replace the unsafe usize::MAX spill-slot drain with a monotonic-by-start commit. New runtime tests cover asm clobbers, asm-goto liveness, and fixed-register precoloring.
Changes:
- Add
Instruction::is_memory_barrier()+ validator I2 (barrier ⇒ side-effect) with structural meta-test. - Add shared GCC inline-asm constraint vocabulary/parser and per-arch
build_asm_instr_constraints_*/ pre-color collectors; wire them into x86_64 and aarch64 chordal coloring. - Replace
usize::MAXspill-slot drain with monotonic per-interval expiration to fix a within-block slot-reuse miscompile; extendfind_call_positionswith per-arch call-like predicates covering libc-emitting opcodes.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cc/ir/mod.rs | Adds is_memory_barrier() + unit tests covering pure ops, atomics, calls, asm clobbers. |
| cc/ir/validate.rs | Adds I2 invariant + BarrierWithoutSideEffect error variant and tests. |
| cc/ir/dce.rs, cc/ir/instcombine.rs | Documents the memory-ordering contract these passes implicitly preserve. |
| cc/arch/mod.rs, cc/arch/asm_constraints.rs | New shared inline-asm constraint parser and types. |
| cc/arch/regalloc.rs | find_call_positions now takes a per-arch is_call_like predicate. |
| cc/arch/x86_64/regalloc.rs | Asm clobber parsing, Fixed-operand precoloring, R10/R11 scratch declaration scaffolding, monotonic spill commit, libc-call list, tests. |
| cc/arch/aarch64/regalloc.rs | Mirror of x86_64 changes (scratch list X9–X11/X16/X17, asm clobber parser, precolor plumbing, monotonic spill commit). |
| cc/tests/codegen/inline_asm.rs | Adds asm-goto liveness test and fixed-register pre-color regression test. |
| cc/tests/codegen/misc.rs | Adds defensive GP-live-across-asm-clobber test (passes both with/without C1). |
Copilot review of the C3 commit (PR comment at cc/arch/x86_64/regalloc.rs lines 1569–1580) flagged a real allocator/codegen split-view bug in `color_gp_bank`: pre_colored.entry(pid).or_insert(reg); // ABI pin wins ... self.locations.insert(pid, Loc::Reg(reg)); // overwritten unconditionally! When a pseudo is already pre-colored — either by an earlier inline-asm Fixed operand or, more importantly, by `allocate_arguments` (SysV pins int args 1..6 to RDI/RSI/RDX/RCX/ R8/R9) — `or_insert` correctly keeps the existing register in `pre_colored`. But the subsequent `self.locations.insert` was unconditional, overwriting the location to the asm-requested register. Chordal coloring (which reads `pre_colored`) and codegen (which reads `self.locations`) would then disagree: coloring would place the value in the prior pin, codegen would emit Load/Store against the new asm register, observing zero. Fix: capture the value `or_insert` actually committed to `pre_colored` and mirror exactly that into `self.locations`. Idempotent when the prior pin equals the asm pin; preserves the prior pin when they differ. Identical fix applied to the aarch64 mirror, where the bug is latent (AAPCS64 has no Fixed letters in the C2 vocabulary, so `collect_asm_fixed_precolors_aarch64` returns empty; but the fix matches the x86_64 commit byte-for- byte to anchor the contract). Tests: - `codegen_inline_asm_x86_64_fixed_precolor_collides_with_abi_pin` exercises the path: function params used directly as `"+S"` / `"+d"` / `"+a"` asm operands in their ABI-pinned register slot. Today's linearizer stores ABI args to local stack slots at entry and asm operands load through a fresh pseudo, so the bug doesn't trigger end-to-end on current pcc — the test can't fail under the broken code today. But it anchors the codegen path for future arg-promotion passes that would make the arg pseudo and the asm operand identical, at which point a silent regression would otherwise have no compile-time signal. Run at both default `-O -g` and -O1. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass (+1 new test; was 946 + 980 + 213 in C6b) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged that both `build_asm_instr_constraints_x86_64` and `_aarch64` carry a `let _ = OperandKind::Use;` statement ostensibly silencing an unused-variant warning. The comment is wrong: `OperandKind::Use` is constructed by `parse_constraint` in `cc/arch/asm_constraints.rs:159` (the default kind for non-output, non-early-clobber operands), so the variant is actively used. The silencer line is dead code; removing it along with the now-unused `OperandKind` import. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged `None?` as the fallthrough arm of `parse_gp_clobber_name` in `cc/arch/aarch64/regalloc.rs:708`. The x86_64 mirror at line 448 uses the more idiomatic `return None`. Aligning aarch64 to match. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged that the inline-asm branch in `get_constraint_info` unconditionally augments the clobber set with R10/R11 via `opcode_clobbers_r10_r11`, but the explanatory comment immediately below only applies to the non-asm path. The augmentation is real and intentional (not unintentional): `emit_inline_asm` in `cc/arch/x86_64/codegen.rs` uses R10/R11 for operand shuffling into Fixed-letter registers, for `find_temp_reg` fallbacks when allocated regs collide with the reserved-scratch set, and for input/output spill helpers around the asm body. Today's reserved- scratch convention makes this a no-op; when C4's freeing lands, the augmentation becomes load-bearing for inline asm too. Added an explanatory comment at the asm branch covering this. No behavior change; doc-only. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 946 + 980 + 214 pass - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the C9 milestone — full GCC-style inline-asm
constraint vocabulary. C9a is the parser-only piece: it adds the
`OperandConstraint::Alternatives(Vec<_>)` variant and teaches
`parse_constraint` to build it from multi-character bodies like
`"rm"`, `"ri"`, `"rmi"`, and `"g"` (sugar for `"rmi"`). The
allocator and codegen wiring lands in C9b; this commit ships
zero behavior change beyond letting these constraint strings
parse without an error.
Why now: portable kernel/UAPI headers (Linux `arch/x86/include/`,
glibc internals, FreeBSD sys/) routinely use `"rm"` to let the
compiler pick register vs memory based on what's cheapest at the
call site. Before C9 these compiled with a hard parse error;
after C9 they're accepted with a sensible default until the
codegen learns to actually exploit the flexibility.
Parser grammar (updated docstring):
modifiers ::= '&'? ('=' | '+')?
constraint ::= modifiers body
body ::= class_letter+ | single_letter
class_letter ::= 'r' | 'w' | 'm' | 'Q' | 'i' | 'n' | 'g'
single_letter ::= class_letter | '0'..'9' | <Fixed letter>
Multi-character bodies build an `OperandConstraint::Alternatives`
whose inner list is restricted to `Any` / `Mem` / `Imm` — Fixed
letters and Match digits cannot appear inside (GCC's `"ra"` /
`"r0"` are nonsensical and rejected with `AlternativeWithFixed`).
Implementation notes:
- The single-character fast path still routes Fixed letters and
Match digits through to their existing variants — no behavior
change for C2/C3 code.
- `g` is special-cased to `Alternatives([Any, Mem, Imm])`; when
it appears inside a multi-char body it's flattened.
- Duplicates dedup (e.g. `"rr"` → `Any`, `"rmm"` → `[Any, Mem]`).
- The old `ConstraintParseError::MultiAlternative` variant is
replaced by the more specific `AlternativeWithFixed(char, _)`
for the only remaining error case (class + Fixed/Match mix).
- `lower_instr_constraints_to_constraint_point` in both backends
gets a new `Alternatives(_)` arm that's a no-op — by
construction these alternatives don't add allocator clobbers.
C9b will pick an alternative based on operand fit and feed it
into emit_inline_asm.
Tests: +9 new parser unit tests in `cc/arch/asm_constraints.rs`:
- `parse_multi_alt_{rm,ri,rmi}` — each common combo
- `parse_g_is_sugar_for_rmi` — equivalence of `g` and `rmi`
- `parse_multi_alt_g_flattens` — `rg` produces the flattened set
- `parse_multi_alt_dedup` — `rr` / `rmm` collapse correctly
- `reject_alternative_with_fixed_letter` — `ra` / `=mb` rejected
- `reject_alternative_with_match_digit` — `r0` rejected
- `single_char_fixed_still_works` — C3 path preserved
- `single_char_match_still_works` — Match digits preserved
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 955 + 989 + 214 pass
(+9 new parser tests; was 946 + 980 + 214)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
CPython smoke test deferred until C9b (when the new variant
actually influences codegen — today it can't reach Alternatives
from any real C source because the front end's inline-asm path
hasn't started exercising multi-char bodies; this commit just
sets up the foundation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in C9. C9a taught the parser to build
`OperandConstraint::Alternatives`; C9b wires the linearizer and
the x86_64 codegen to honor "register OR memory" and "register OR
immediate" semantics end-to-end. The user-visible result: code
like `__asm__("addl $1, %0" : "+rm"(x))` — the canonical
kernel/UAPI multi-alt idiom — now compiles and runs correctly
on both arches at default and -O.
The previous behavior had three bugs that combined to produce
silently wrong output on this pattern:
1. **Linearizer treated `"rm"` as `"m"`**. `parse_asm_constraint`
set `is_memory = true` if any of `m`/`o`/`V`/`Q` appeared in
the constraint string. For `"rm"` this fired even though
register is acceptable, so the linearizer took the `lvalue`
path (`linearize_lvalue`) and passed the operand's ADDRESS
as the asm pseudo. The codegen later substituted `%1` with
the register holding the address, so the asm template read
the address bits as if they were the value.
2. **x86_64 codegen's `constraint_requires_register` returned
true for any constraint containing `r`**. For `"rm"` this
force-loaded spilled values into a temp register, defeating
the whole point of multi-alt — the memory operand syntax was
never used.
3. **x86_64 codegen's `constraint_requires_memory` returned
true for any constraint containing `m`**. For `"rm"` it
would force memory syntax even when register was clearly the
right choice given the operand's location.
Fix:
- `cc/ir/linearize_stmt.rs::parse_asm_constraint` — `is_memory`
is now strict: true iff every alternative listed is a memory
class (`m`/`o`/`V`/`Q`) AND no non-memory class
(`r`/`a`..`d`/`S`/`D`/`q`/`R`/`i`/`n`/`g`/`X`). For multi-alt
strings like `"rm"` / `"rmi"` / `"g"`, the linearizer
evaluates the operand as a value (`linearize_expr`) and lets
the codegen substitute register vs memory syntax based on the
operand's actual location.
- `cc/arch/x86_64/codegen.rs::constraint_requires_register` —
returns true iff the constraint lists ONLY register-class
letters. Any non-register letter (`m`/`o`/`V`/`Q`/`i`/`n`/`g`)
means a spilled value need not be force-loaded to a temp.
- `cc/arch/x86_64/codegen.rs::constraint_requires_memory` —
symmetric: returns true iff the constraint lists ONLY memory-
class letters.
The aarch64 codegen needed no change — its `emit_inline_asm`
follows whatever location the allocator assigned without
classifying the constraint, so multi-alt has worked end-to-end
on aarch64 since C9a.
Tests: +5 new integration tests in `cc/tests/codegen/inline_asm.rs`:
- `codegen_inline_asm_multi_alt_rm_register_path` — `"+rm"` with
register-resident value
- `codegen_inline_asm_multi_alt_m_memory_path` — `"+m"` with
addr-taken operand
- `codegen_inline_asm_multi_alt_g_runtime_value` — `"g"` with a
runtime int input
- `codegen_inline_asm_multi_alt_output_rm_to_memory` — `"=rm"`
output to addr-taken local
- `codegen_inline_asm_multi_alt_mixed_operands` — two `"rm"`
inputs + one `"=r"` output in the same asm (this one caught
the linearizer-address bug)
Each runs at both default `-O -g` and -O1.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 955 + 989 + 219 pass
(+5 new integration tests; was 955 + 989 + 214 in C9a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 28m24s. Linearizer hot-path change leaves
the baseline intact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third commit in the inline-asm constraint vocabulary milestone.
C2 covered the common register-class and Fixed letters; C9 added
multi-alternative constraints (`"rm"`, `"ri"`, `"g"`). C10 fills
out the rare per-arch letters that show up in glibc, kernel
headers, and other portable inline asm:
x86_64 class letters (`parse_x86_64_class_letter`):
- `q` — byte-class register. Modern x86_64 makes every GP register
byte-accessible, so we map to `Any`.
- `R` — legacy 8-register set; same as `r` on x86_64. Maps to `Any`.
- `l` — index registers (RSI, RDI); same as `r` in practice. Maps
to `Any`.
- `I`, `J`, `K`, `L`, `M`, `N`, `O` — constant-range immediates
used for shift counts and small literals. All map to `Imm`.
- `X` — any operand (register/memory/immediate); same as `g`.
aarch64 class letters (`parse_aarch64_class_letter`):
- `I`, `J`, `K`, `L`, `M`, `N`, `S`, `Y`, `Z` — AAPCS64 immediate-
range and zero-register letters. All map to `Imm`.
Mechanism: `parse_constraint` becomes a thin wrapper around the
new `parse_constraint_with_classes`, which accepts a second
closure resolving per-arch class letters in addition to the
existing Fixed-letter closure. Each backend supplies both via
`parse_x86_64_{fixed,class}_letter` / `parse_aarch64_{fixed,class}_letter`.
The `parse_constraint` wrapper passed `|_| None` as the class
mapper for backward compatibility, but turned out to be only
used by tests once the backends migrated to
`parse_constraint_with_classes`; removed it and migrated the
tests directly to keep `parse_constraint_with_classes` as the
single canonical entry point.
pcc does NOT range-check the immediate letters. If the supplied
operand is out of the letter's required range (e.g., `"I"` with
value 32), the assembler rejects the resulting asm template at
assemble time — the same failure mode GCC defaults to.
Codegen-side updates:
- `cc/arch/x86_64/codegen.rs::constraint_requires_register` —
`q`/`R`/`l` join the register-class set; `g`/`X` join the
memory-acceptable set. Immediate letters (`I`/`J`/`K`/...) do
NOT make memory acceptable: combined with `r` (e.g. `"rI"`)
a spilled value still force-loads.
- `cc/arch/x86_64/codegen.rs::constraint_requires_memory` —
all new register-class and immediate-class letters listed
explicitly so the function is exhaustive over C10's
vocabulary.
- `cc/ir/linearize_stmt.rs::parse_asm_constraint` — the new
letters explicitly opt out of `has_memory_class` so the
linearizer evaluates the operand as a value, not an address.
aarch64's codegen needed no further updates — its
`emit_inline_asm` uses whatever location the allocator assigned
without classifying the constraint, so all class letters work
end-to-end once they reach the InstrConstraints lowering.
Tests:
- +6 new parser unit tests in `cc/arch/asm_constraints.rs`
covering: class letter → Any, class letter → Imm, modifier
combination (`=J`), multi-alt with class letter (`rK`),
class-letter precedence over Fixed letter when both maps
resolve the same char, unknown letter still errors.
- +3 new integration tests in `cc/tests/codegen/inline_asm.rs`:
x86_64 `q` (byte-class register store/load), x86_64 `I`
(compile-time shift count), x86_64 `R` (synonym for `r`).
- +1 aarch64 integration test for `K` (logical immediate).
- All 30 existing parser unit tests migrated to
`parse_constraint_with_classes` with a no-op class map.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 961 + 995 + 222 pass
(+6 parser unit, +3 integration on x86_64, +1 on aarch64; was
955 + 989 + 219 in C9b — note 222 = 219 + 3 on x86_64 since the
aarch64-only test is `#[cfg(target_arch = "aarch64")]`)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 28m49s. The linearizer + codegen letter-set
expansions don't regress real-world code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closing the inline-asm constraint vocabulary milestone. C9c was originally scoped as the codegen optimization that substitutes a const-propagated operand as a literal (e.g. `mov $42, %eax`) when the asm constraint allows immediate (`"i"`/`"n"`/`"ri"`/`"g"` /x86_64 `"I"`/.../`"O"`). Audit found that `emit_inline_asm`'s `Loc::Imm(v)` arm already does this implicitly: any operand whose allocator location is `Loc::Imm` renders as `$value` regardless of which class letter accepted it. Verified by inspecting the generated assembly for `"ri"(42)`, `"g"(99)`, and `"I"(3)` — each produces a direct `mov $N`/`shl $N` instruction with no intermediate register load. C9c therefore ships as lock-in tests only, no codegen change. Four new tests in `cc/tests/codegen/inline_asm.rs`: 1. `codegen_inline_asm_imm_const_via_ri` — `"ri"(42)` substitutes as `$42`. 2. `codegen_inline_asm_imm_const_via_g` — `"g"(99)` substitutes as `$99`. 3. `codegen_inline_asm_imm_const_via_i_letter` — x86_64 `"I"(3)` substitutes as a shift count. 4. `codegen_inline_asm_imm_runtime_via_ri_uses_register` — negative case: `"ri"` with a runtime operand goes through a register, NOT substituted as literal. Pins the boundary so a future codegen change that, e.g., always force-loaded Loc::Imm operands wouldn't accidentally pass. Each test runs at both default `-O -g` and -O1. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 961 + 995 + 226 pass (+4 new lock-in tests; was 961 + 995 + 222 in C10) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean CPython smoke test skipped — this commit is test-only, no production-code change can have regressed the baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit conclusion: the Sreedhar et al. "Translating Out of SSA" algorithm for parallel-copy sequentialization is already implemented in `sequentialize_copies` and exercised end-to-end by `test_phisource_swap_problem`. The remaining work for M8 is lock-in coverage of cases the existing tests don't reach. Two new unit tests in `cc/ir/lower.rs`: 1. `test_sequentialize_two_disjoint_cycles` — two independent 2-cycles in one parallel-copy block. The algorithm must break each cycle with its own temp and emit all four destination writes. Catches a regression where the cycle-breaking loop would fail to terminate when multiple SCCs are present simultaneously. 2. `test_sequentialize_fp_cycle_propagates_type` — a 2-cycle on float-typed copies. The temp pseudo created to break the cycle inherits the source's type via the emitted Copy's `typ` field; `identify_fp_pseudos` then picks the temp up by scanning instruction types. Without this propagation the temp would be misallocated to the GP bank and corrupt the FP cycle's values. No production-code change — the algorithm was already correct. These tests pin the contract so a future refactor of `sequentialize_copies` (e.g., adopting Briggs's window-based 2-temp-budget approach) doesn't accidentally lose the multi-SCC or type-propagation properties. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass (+2 new tests; was 961 + 995 + 226 in C9c) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 29m34s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First commit in the M9 register-coalescing milestone. M9a is the codegen-side foundation: when `emit_copy_with_type` sees the allocator placed src and dst at the same `Loc`, skip emission entirely instead of emitting `mov %eax, %eax` etc. Gated on `actual_size >= 32` (so narrow-type 8/16 bit truncations via AND/SHL+SAR still run) and `!is_fp_copy` (FP path has size- specific quirks). Measurement on CPython 3.12.9 `Python/ceval.o` showed 105 of 10,653 reg-to-reg movs were "identity" (`mov %X, %X`) under the old codegen, but 104 of those 105 are the deliberate `mov %eax, %eax` zero-extension idiom that pcc emits to clear the upper 32 bits after a 64-bit op — those go through a different path (the explicit narrowing in emit_move and the operand-size machinery) and are NOT routed through emit_copy_with_type. Only 1 actual 64-bit identity Copy fires the new short-circuit in ceval.o. So M9a alone is a near-no-op on real code. Its role is **infrastructure for M9b**: once M9b's allocator-level coalescing lands and forces matched Copy operands to share a location, every successfully-coalesced Copy collapses to identity here and is elided. M9a is the elision sink that makes M9b's gains visible. `Loc` on aarch64 now derives `PartialEq` (needed by the src_loc == dst_loc check). x86_64's Loc already derived it. Verified: - `cargo build --release` clean - `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass (unchanged from M8) - `cargo clippy --all-targets` zero warnings - `cargo fmt --all -- --check` clean - CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests pass, 0 failures, 30m17s. Baseline holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second commit in the M9 register-coalescing milestone, and the
piece that produces a measurable win. After the chordal allocator
finishes assigning every GP pseudo a `Loc`, walk every
`Opcode::Copy { target: t, src: [s] }` in the function and migrate
`t`'s location to `s`'s location whenever the move is safe. The
Copy then becomes identity and M9a's `emit_copy_with_type`
short-circuit elides it.
Migration is safe iff:
- both `t` and `s` are in the bank being coalesced (GP for
`color_gp_bank`);
- `t` is not pre-colored (ABI-pinned arg or asm `Fixed`-letter
operand; moving it would violate the constraint that pinned it);
- `t` and `s` do not interfere (the IR doesn't ask them to hold
distinct values simultaneously);
- `t`'s forbidden-color set doesn't contain `s`'s register (the
constraint system reserved that register against `t`, e.g. a
cross-call pseudo against caller-saved regs);
- no neighbor of `t` already holds `s`'s register (would create a
same-color conflict at the new location).
Naive non-Briggs heuristic: per-candidate decision, no global
degree analysis, no spill-regression guard. Empirically the
chordal allocator's existing spill paths absorb any edge cases.
Measurement on CPython 3.12.9 `Python/ceval.o` (-O2):
- reg-to-reg movs: 10,653 → 10,614 (-39, -0.4%)
- .o size: 395,480 → 389,432 bytes (-6,048, -1.5%)
Small but real. The mov-savings come predominantly from phi-
elimination copies (every loop's back edge in `_PyEval_EvalFrameDefault`'s
massive dispatch loop) and matched-asm-constraint init copies
(closes the C8 deferral implicitly — the linearizer's explicit
`Copy(out_pseudo, val)` for `"0"(x)`-style constraints now folds
into M9a elision when the operands' lifetimes don't interfere).
C8 was deferred because matched-asm coalescing wanted a Briggs
heuristic to avoid spilling. The post-coloring approach
sidesteps that issue: the chordal allocator has already made all
the placement decisions; coalescing only swaps locations after
the fact when the swap is safe. No new spills introduced.
Implementation:
- `cc/arch/regalloc.rs::find_copy_coalesce_candidates` — walk
Opcode::Copy candidates.
- `cc/arch/x86_64/regalloc.rs::color_gp_bank` — coalesce after
Phase 3 commit, before spill commits.
- `cc/arch/aarch64/regalloc.rs::color_gp_bank` — mirror.
- FP banks (`color_xmm_bank` / `color_vreg_bank`) intentionally
left out of M9b — FP-typed Copies are rarer and the codegen's
FP move paths have size-specific quirks that the naive
migration doesn't account for. A future M9c could extend.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 963 + 997 + 226 pass
(unchanged from M9a)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 29m28s. Allocator-level coalescing leaves
the baseline intact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…locks
Final commit in the M* regalloc-redesign milestone. M10 is the
"dead-code freeze" piece: defensive validator infrastructure that
locks in the optimizer's structural contract before the codegen
layer consumes the IR.
New invariant I3: every branch-style instruction's `BasicBlockId`
reference (Br's `bb_true`, Cbr's `bb_true` + `bb_false`, Switch's
`switch_cases` + `switch_default`, inline asm-goto's
`AsmData.goto_labels`) must point at a block actually present in
`func.blocks`. A reference to a deleted or never-created block ID
crashes the codegen layer's label-resolution pass with a
confusing error far from the source bug; I3 catches the
corruption at the optimizer/codegen boundary instead, with a
clear `ValidationError::InvalidBranchTarget { block, index,
opcode, target }` diagnostic.
Like I1 and I2, the check runs under `debug_assertions` only —
release-mode pcc skips it for zero overhead. The validator gate
at `cc/opt.rs::optimize_module` already calls `validate_module`
under `debug_assert!`, so I3 is wired automatically.
Tests:
- `i3_valid_branch_target_passes` — well-formed Br succeeds.
- `i3_invalid_br_target_flagged` — Br to a nonexistent block
produces `InvalidBranchTarget`.
- `i3_invalid_cbr_false_target_flagged` — Cbr's `bb_false` is
checked independently from `bb_true`.
I4/I5/I6 (Copy bank consistency, phi predecessor CFG agreement,
Load/Store address-bank) would each need additional plumbing
(TypeTable or CFG-predecessor maps) for marginal defensive
value. Out of scope.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
(+3 new tests; was 963 + 997 + 226 in M9b)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 28m06s. Validator-only change, no codegen
impact in release builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes a long-standing aarch64 codegen gap exposed by C9b's new
`codegen_inline_asm_multi_alt_m_memory_path` test on the
macos-aarch64 CI worker. The test compiles:
int x = 99;
int *p = &x;
__asm__("ldr w8, %0\n\tadd w8, w8, #1\n\tstr w8, %0" : "+m"(*p));
For the `"+m"(*p)` constraint, the linearizer takes the lvalue
(returning an address pseudo). The chordal allocator places that
address pseudo in a register. Old codegen substituted `%0` with
the bare register name (`w0`), producing:
ldr w8, w0 <- error: expected label or encodable integer pc offset
str w8, w0 <- error: invalid operand for instruction
The aarch64 assembler correctly rejects this because `ldr`/`str`
expect a memory operand (e.g. `[x0]`), not a register name.
Fix: in `emit_inline_asm`, when the constraint is memory-class
(`m`/`o`/`V`/`Q` only — `"rm"` and `"g"` etc. are excluded since
register or immediate is also acceptable), AND the operand's
location is `Loc::Reg(r)`, substitute the operand as `[xN]`
instead of the bare register name. This matches the x86_64
mirror's `Loc::Reg(r) if requires_mem` arm at
`cc/arch/x86_64/codegen.rs:3622`.
New helper `Self::constraint_requires_memory(constraint: &str)`
mirrors x86_64's equivalent (added in C9b). It counts as
"requires memory" only when every alternative in the constraint
is a memory class — single-class `"m"` returns true; multi-alt
`"rm"`/`"g"` return false so register substitution still wins
when the operand fits in a register.
Out of scope for this commit: the `Loc::Stack` / `Loc::IncomingArg`
case for memory-class operands. If the address pseudo itself
spills to the stack, the current fallthrough hits
`loc_to_asm_string`, which gives `[sp, #offset]` — the address
of the spill slot rather than the value stored in it. A correct
fix would emit a pre-asm load to a temp register and use that
register as `[xN]`. None of the existing tests exercise the
spilled-address path (the test functions are too small to push
the address pseudo out of registers), and a fix would require
the same input-load / output-restore plumbing x86_64 has at
`emit_inline_asm`. Deferred until a real spill case surfaces.
Verified locally:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
(no change in counts; the failing test runs only on aarch64)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
Linux-amd64 cargo tests were green before this change; the fix
is to the aarch64-only emit_inline_asm path and cannot regress
x86_64 codegen.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two backends' emit_inline_asm previously maintained four
parallel Vecs for each operand's substitution data:
let mut operand_regs: Vec<Option<Reg>> = ...;
let mut operand_sizes: Vec<u32> = ...;
let mut operand_mem: Vec<Option<String>> = ...;
let mut operand_names: Vec<Option<String>> = ...;
Every branch of the operand-classifying match pushed to (some of)
these vecs in lockstep -- there were 14 push pairs across the two
backends, each one a bug waiting for a missed push to silently
de-sync the arrays.
Replaced with a single AsmOperandSlot<R> struct in
cc/arch/codegen.rs:
pub struct AsmOperandSlot<R> {
pub reg: Option<R>,
pub mem: Option<String>,
pub size: u32,
pub name: Option<String>,
}
and a single Vec<AsmOperandSlot<Reg>> per emit_inline_asm. Each
branch builds and pushes one slot, with all four fields populated
by construction. A per-iteration mk(reg, mem) closure captures
the operand's size and name once at the top of the loop so each
branch only specifies the (reg, mem) pair appropriate to its
outcome.
substitute_asm_operands (shared in cc/arch/codegen.rs) and the
per-backend wrappers now take a single &[AsmOperandSlot<F::Reg>]
slice instead of four parallel slices. The substitution logic
itself is unchanged -- only the indexing syntax.
Behavior-preserving. The bookkeeping vecs that DO have distinct
roles -- output_moves / input_moves / remap_setup / remap_restore
/ pseudo_to_temp -- stay separate; they're not parallel to the
per-operand substitution data.
Verified:
- cargo build --release clean
- cargo test --release -p posixutils-cc: 966 + 1000 + 226 pass
(unchanged)
- cargo clippy --all-targets zero warnings
- cargo fmt --all -- --check clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 30m41s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes that together restore GCC-correct semantics for
memory-class inline asm outputs (`"=m"(*p)`, `"+m"(*p)`, and
their `o`/`V`/`Q` siblings). The macos-aarch64 CI was failing on
the C9b `codegen_inline_asm_multi_alt_m_memory_path` test
because of the codegen half of this issue; the test was passing
on x86_64 only by accident (the old behaviour quietly used
register syntax and a post-asm Store to paper over the bug).
Linearizer change (cc/ir/linearize_stmt.rs):
Previously, for `"=m"(*p)` the linearizer:
- allocated a fresh pseudo for the output,
- on `is_readwrite` (`"+m"`) emitted Load(pseudo, addr) — loading
the *value* at *p into the pseudo,
- inserted a post-asm Store(pseudo, addr) writing back.
The pseudo therefore held a *value*, while the constraint
declared the operand as memory. The asm template substitution
treated `%0` as a register holding the value (x86) or the value
loaded into a w-register (aarch64), and on aarch64 the assembler
rejected the resulting `ldr w8, w0` outright. On x86_64 it
"worked" via the Load+Store dance — but the asm `addl $1, %0`
became `addl $1, %eax`, modifying a register that subsequently
got stored back to memory. Wrong shape; happens to compute the
right answer for simple increment patterns.
Now, when the output's constraint is memory-class
(`is_memory == true`):
- the output pseudo is the *address* itself, taken from
`linearize_lvalue(&op.expr)`;
- no initial Load is emitted;
- the matched `+m` input shares the same address pseudo;
- the post-asm Store is skipped (new `skip_post_handling`
parallel-vec flag — separate from the existing `param_outputs`
so the three states "skip", "store to addr", "update var_map"
are all expressible).
The asm template's `%0` then resolves to the memory operand
syntax with the address register: `[xN]` on aarch64,
`(%rN)` on x86_64. The hardware ldr/str / addl-to-memory does
the actual read/write through that address.
Codegen change (cc/arch/x86_64/codegen.rs):
Added the missing `Loc::Reg(r) if requires_mem` arm on the
*output* side of `emit_inline_asm`. The input side already had
the corresponding arm at line 3642 (`format!("(%{})", ...)`);
the output side fell through to bare-register substitution,
so `addl $1, %0` substituted `%eax` instead of `(%rax)`. The
new arm mirrors the input side: render the address register
as `(%rN)`.
aarch64's output side already had the equivalent `[xN]` arm
from the macos-aarch64 fix commit; no change needed there.
Why CPython doesn't show the underlying bug:
A grep of CPython 3.12.9's Python/, Modules/, and Include/
trees shows zero uses of `"=m"` or `"+m"` outputs in code that
gets compiled under our pcc build. `pymath.c`'s `fnstcw`/`fldcw`
asms live under `#ifdef HAVE_GCC_ASM_FOR_X87`, which is unset
in our configuration. So this fix is exercised entirely by the
posixutils-cc inline-asm test suite, not by CPython workloads.
Verified:
- `cargo build --release` clean
- `cargo test --release -p posixutils-cc`: 966 + 1000 + 226 pass
(the previously-failing `codegen_inline_asm_multi_alt_m_memory_path`
now passes on both arches; x86_64 path verified locally,
aarch64 path verified by inspection — the codegen produces
`[xN]` syntax via the existing requires_mem arm)
- `cargo clippy --all-targets` zero warnings
- `cargo fmt --all -- --check` clean
- CPython 3.12.9 -O2 from a clean tree: 40,817 individual tests
pass, 0 failures, 28m56s. (A first run flaked on test_venv's
multiprocessing spawn at minute 28 — the failure was a
resource-accumulation artifact of the fork-heavy test parade
ahead of it; running test_venv in isolation passes, and the
retry of the full suite passes cleanly.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.