Skip to content

Add linear length penalty as a summed GRPO advantage#2829

Draft
hallerite wants to merge 66 commits into
mainfrom
feat/grpo-linear-length-penalty
Draft

Add linear length penalty as a summed GRPO advantage#2829
hallerite wants to merge 66 commits into
mainfrom
feat/grpo-linear-length-penalty

Conversation

@hallerite

@hallerite hallerite commented Jun 16, 2026

Copy link
Copy Markdown
Member

Stacked on #2746 (base: feat/algorithm-abstraction).

What

Brings the MAI linear length penalty (from #2702) into the algorithm abstraction as a third length-penalty arm on the GRPO-family algorithms, plus an optional length-weighted baseline. No new algorithm type — length penalties stay "just configs" under [orchestrator.advantage.length_penalty].

  • type = "linear" (LinearLengthPenaltyConfig): subtracts coef * pass_rate * (completion tokens / orchestrator.seq_len) from each reward before the baseline, where pass_rate is the group's mean reward — so reliably-solved problems get the strongest concision pressure and never-solved groups get none. Optional gate_by_correctness restricts it to correct rollouts (reward == 1).
  • length_weighted_baseline on GRPO: use sum(len_i · reward_i) / sum(len_i) as the baseline instead of the plain mean.

The existing tokens / turns efficiency-shaping arms are unchanged.

The penalty is a summed advantage, not a folded reward

Because centering is linear:

center(reward − penalty) = center(reward) + center(−penalty)

So the linear penalty is modeled as a standalone additive advantage that sums onto plain GRPO, rather than threading penalty/baseline knobs through one shared group-norm function. advantage.py now exposes three pure, cohesive functions:

  • grpo_advantage(group, length_weighted_baseline) — plain GRPO (owns the baseline choice).
  • length_penalty_advantage(group, config, max_seq_len, length_weighted_baseline) — the group-centered negative penalty −(pᵢ − baseline) (owns max_seq_len).
  • efficiency_shaping_advantage(group, config)tokens/turns (non-additive; replaces the baseline).

GRPOAlgorithm.score_group composes GRPO + linear penalty by summation. tokens/turns remain non-additive shaping. The decomposition is exact only when both terms use the same baseline operator, so length_weighted_baseline is threaded into the penalty term too (it centers by the plain or the token-length-weighted mean to match the GRPO term).

Wiring

max_seq_len (the penalty denominator) is injected onto the Algorithm by build_algorithm and threaded from the orchestrator (config.seq_len) through TrainEnvs — so it touches only the GRPO path rather than every algorithm's constructor.

Verification

Numerically validated against a faithful reimplementation of #2702's original folded default_advantage_fn over 2000 random groups per case: all four gate_by_correctness × length_weighted_baseline combinations are bit-for-bit identical. (An earlier cut centered the penalty by the plain mean unconditionally and diverged from #2702 when the linear penalty met length_weighted_baseline; the fix threads the baseline operator into the penalty term — see the second commit.)

uv run pytest tests/unit/orchestrator -q → all passing. New unit tests cover pass-rate scaling, the GRPO + penalty sum equivalence (vs. folding penalty into reward, plain and length-weighted), correctness gating, zero pass-rate no-op, and the end-to-end GRPOAlgorithm.score_group path including max_seq_len injection. Lint clean.

🤖 Generated with Claude Code

hallerite and others added 30 commits June 9, 2026 19:58
…d, sft_distill, self_distill, echo)

Replace the global training_mode enum with a per-env Algorithm abstraction:
a preset bundle of (1) sampling source, (2) scoring (group advantage +
async token scorer), and (3) per-token loss routing. The trainer becomes
algorithm-blind: routing ships per token on the wire and the trainer
executes three fixed loss cores (rl / ce / teacher_kl).

- configs: new prime_rl.configs.algorithm with AlgorithmConfig presets,
  component-level overrides, compatibility validation (incl. the
  group-relative-advantage-with-group_size=1 footgun warning);
  training_mode kept as a deprecated alias
- orchestrator: per-env algorithm; dispatcher selects student/teacher pool
  per env (no mode branches); OPD teacher logprobs moved out of
  finalize_train_batch into a bounded-concurrency token scorer;
  demo-conditioned teacher scorer for SDFT; interleave_rollout can tag
  env-observation tokens for ECHO
- wire: TrainingSample/MicroBatch carry loss_core + optional per-token
  cores/weights/advantages (omit_defaults — plain GRPO wire unchanged);
  packer no longer bins by mode
- trainer: unified per-token loss routing, bit-for-bit with the previous
  rl/opd/sft loss fns on pure batches

Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke
runs for grpo (reverse_text), opd (teacher pool + alias path), and echo
(multi-turn alphabet-sort, per-token routing verified on the wire).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hm strategy object

"Teacher" is no longer a concept anywhere in the system. There is the live
policy (reserved registry key "policy") and named frozen hosted models under
[orchestrator.models.<key>]; algorithm components hold references into that
registry. The same entry can serve any number of envs' algorithms, and
self_distill can point its demo scorer at "policy" itself — the SDFT paper's
setting, zero extra deployments.

- configs: scorer types logprobs/demo_logprobs with required model refs;
  sampling.source is a registry key; algorithm.model shorthand folds into the
  unresolved component; orchestrator.teacher and training_mode deleted;
  student renamed policy; registry validation (refs resolve, entries used,
  "policy" reserved, degenerate logprobs@policy rejected)
- runtime: ModelRegistry + per-env Algorithm strategy object as the sole
  interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and
  never branch on algorithm config; liveness drives cache salting, sampling
  logprobs, and off-policy aging (frozen-sourced rollouts no longer age)
- wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl,
  time/scoring metric
- fixes found by the new SDFT smoke: resolved-config round-trip (shorthands
  are now write-only / excluded from dumps) and apply_chat_template returning
  BatchEncoding on newer transformers
- configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml
  running SDFT against the live policy); docs/skills updated

Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry
0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32
trainable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gy ontology

Every training signal is an advantage — varying in granularity (group-scalar
vs per-token) and evaluation site (orchestrator vs trainer). The advantage
union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs ->
demo_ref_kl), the action-token loss core derives from the strategy instead of
being configured (loss.action deleted), and runtime AdvantageStrategy objects
own both execution points: group-time assign() and ship-time score().

Also fixes a shorthand-folding regression: resolve_preset's component
assignment polluted model_fields_set, so any [orchestrator.advantage]
shorthand differing from the preset raised a bogus conflict error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction

# Conflicts:
#	packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
#	src/prime_rl/orchestrator/dispatcher.py
#	src/prime_rl/orchestrator/orchestrator.py
A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones
(grpo/echo) extended ref_logprobs without backfilling or padding, shifting
it out of alignment with input_ids. Mirror the rewards/loss_core_ids
pattern with 0.0 placeholders (already the outside-the-mask filler used by
the demo scorer and pad_micro_batch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Misaligned parallel arrays (the ref_logprobs packing bug class) now fail
loudly at pack time instead of corrupting training silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The config surface key is now [orchestrator.algo] (per-env: algo = {...});
the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL,
TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids,
advantage.action_loss_type). Also scrubs stale token-scorer mentions from
the ref_kl error message and the configs skill.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and
`[orchestrator.student]` fold in as aliases, with the canonical key winning
at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys
still re-nest ([orchestrator.model] name = ...). Shared-field propagation
checks all spellings for conflicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two envs with different algorithms in one run — exercises heterogeneous
train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO
samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…r references

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… registry

prime-rl now assumes it only ever hosts the trainable policy. Frozen models
are external endpoints declared inline on the algorithm component that uses
them (FrozenModelConfig: model.name + required client.base_url) — no more
[orchestrator.models] namespace or runtime ModelRegistry. Each env's
Algorithm builds and readies its own frozen pools in async setup(); the
dispatcher reads algorithm.sampling_pool and gets the policy pool directly.

References are "policy" | inline config; demo_ref_kl now defaults to
"policy" (the SDFT setting needs zero config). The algo.model shorthand
folds with fill-or-agree semantics, which also fixes the two Bugbot
findings (redundant-but-consistent model rejected; advantage shorthand
clearing a folded model).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A frozen model reference is the client config we already have plus the one
request-level datum it lacks: the served model's name. Drops the nested
{model, client} shape — TOML reads `[orchestrator.algo.model]` with
name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning,
which still read the deleted [orchestrator.models] dict.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The registry's reserved-key concept is gone; "policy" survives only as the
Literal arm of ModelReference, which the type already enforces. Inline the
string at the comparison sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the LOSS_TYPE_* int constants with a LossType IntEnum. The scalar
TrainingSample.loss_type field carries the enum (msgspec validates
membership on decode, so a corrupt wire value fails loudly); the per-token
arrays stay list[int] — the trainer tensorizes them immediately, so
per-token enum wrapping on decode would be pure overhead. Wire bytes are
unchanged: members encode as plain ints.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gies

AdvantageOutputs.token_advantages carries optional per-rollout lists
aligned to completion tokens; finalize_group pads prompt positions and
stamps TrainingSample.token_advantages (the trainer side already
preferred it over the scalar broadcast). Rollouts that split into
several samples or misaligned lengths are rejected loudly.

Verified end-to-end on a 5-step reverse-text run: a custom strategy
emitting alternating scalar/scalar*0.5 advantages shows the exact
pattern in the trainer token export for all 128 sequences.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
algo/algorithm.py (Algorithm runtime + frozen pools + score_train_batch),
algo/strategies.py (AdvantageStrategy objects), algo/advantage.py (pure
math + the custom-fn interface, moved), algo/routing.py (wire stamping).
Public surface re-exported from the package root; the user-facing custom
advantage import is now prime_rl.orchestrator.algo.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…frozen_pool/connected_pools

The names sounded like prime-rl hosts the frozen models. It only opens
client pools to externally hosted endpoints — connecting, never
launching; shutdown closes clients, never servers. Docstrings aligned.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace per-token loss-type routing (one loss type per token) with a sum
of three components -- rl, ce, ref_kl -- each with its own per-token
weight stream and its own global normalization:

  L = sum(L_rl)/N_rl + sum(L_ce)/N_ce + sum(L_ref_kl)/N_ref_kl

Wire: loss_type / token_loss_types / token_loss_weights (and the
LossType enum) are replaced by optional rl_weights / ce_weights /
ref_kl_weights streams. Absent streams mean rl weight 1.0 on the loss
mask, so plain GRPO ships nothing extra and keeps the no-sync hot path.
A weight scales its component's per-token loss; 0.0 removes the token
from the component's mask and denominator; components may overlap on the
same token (gradients sum). Per-env component weights fold into the
streams orchestrator-side, so the trainer stays algorithm-blind.

Per-component normalization fixes the echo dilution bug: observation
tokens no longer flip into completion_mask (the ce stream trains them),
so they leave the rl denominator and the rl term's effective per-token
learning rate no longer scales with the batch's obs/action ratio. The
three denominators come from one batched all-reduce, so every rank
issues the same collective.

Smoke-validated 5 steps each (exit 0): grpo eval 0.13->0.53 (hot path),
echo on alphabet-sort (ce stream on obs tokens), opd eval 0.067->0.684
(ref_kl stream, rl stream zeroed, inline frozen reference).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nput

Preset resolution moves from after-validators that mutate validated models
to a mode="before" merge on raw dicts: _PRESETS is now a plain table of
component deltas from grpo (the field defaults), merged under the user's
keys before any model is built. This deletes the preset lambda factories,
the __pydantic_fields_set__.discard provenance surgery, and the
resolve_algorithm fold/re-validate machinery — field provenance is now
exactly what the user wrote, so nothing needs fixing up after the fact.

- AlgorithmConfig components are non-optional with grpo defaults; presets
  encode only their deviation. A typeless advantage override now inherits
  the preset's strategy type (merge is discriminator-aware: a differing
  type replaces the strategy wholesale).
- The orchestrator/env `advantage` shorthands fold into algo.advantage in
  a before-validator; env algorithm inheritance stays a small
  after-validator (covers default-constructed envs). Each AlgorithmConfig
  validates exactly once.
- fold_model_shorthand is inlined back into the fold_model validator (no
  re-callable indirection needed anymore).

Validated: full unit suite, resolved-config round-trip, legacy [[env]]
layout, and a 5-step grpo smoke (reward 0.13 -> 0.55, 128/128 trainable).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d model roles

The AdvantageStrategy layer dissolves into named runtime classes
(GRPOAlgorithm, OPDAlgorithm, OPSDAlgorithm, SFTDistillAlgorithm,
RewardAlgorithm, CustomAlgorithm) — each owns its group-time assign() and
ship-time score() directly, so reading a class top to bottom reads the
algorithm, and writing your own is subclassing Algorithm and overriding
the same two methods. Orchestration duplication between similar
algorithms (OPD/OPSD) is accepted; shared math stays as plain functions
in algo/advantage.py. Dispatch is keyed on advantage.type (preset names
are vetted parameterizations: echo builds GRPOAlgorithm with observation
routing); the reference pool moves from the strategy to the base class.

Algorithms also declare what they need: action_loss_type as a ClassVar
on the class, and model_role ("teacher" on the distillation algorithms),
which makes [orchestrator.algo.teacher] a parse-time alias for the model
shorthand and puts the same word in validation errors. Roles stay
algorithm-local: flow code and the wire keep branching on liveness only.

Config changes are additive (teacher alias, role-aware messages); the
preset/component layer and the wire are untouched.

Validated: full unit suite, 5-step grpo and self_distill GPU smokes
(both exit 0; OPSD scoring against the live policy pool).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nizer leaves the runtime

Text -> token ids always goes through the renderer — the same path the
policy's own prompts take — so OPSD's demo-conditioned scoring prefix is
rendered with renderer.render_ids instead of the raw HF chat template.
demo_ref_kl now requires a renderer, validated at config time: rendering
the scoring prefix differently than the policy's prompts is rejected
rather than approximated, which is what the tokenizer fallback was.

With the fallback gone the tokenizer has no consumer left in the
algorithm runtime, so the constructor flattens to
Algorithm(config, policy_pool, renderer) — no context wrapper.

Validated: full unit suite, 5-step self_distill GPU smoke exit 0
(prefix rendered via Qwen3Renderer.render_ids against the policy pool).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…the env Sampler

An algorithm is credit assignment and loss routing, fused: one mapping from
a finalized rollout to per-token (loss component, weight). algo.loss is
deleted — echo becomes a proper advantage type (EchoAdvantageConfig /
EchoAlgorithm) and every preset delta is now a component-type swap. The
bundle's other half, sampling, gets its own runtime object: Sampler owns the
generating pool, frozen-source connection, sampling args, and liveness;
future replay/branching strategies extend there, not the algorithm.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Group-norm scalars on OPD rollouts were dead weight: ref_kl_loss_fn zeroes
the scalar gradient, so they only steered the DPPO mask direction and the
zero-advantage filter — which dropped uniform-reward OPD groups despite
their full teacher-KL signal. A/B on reverse-text (50 steps) ties with the
scalar baseline (0.825 vs 0.828).

OPDAlgorithm loses its assign override (advantage stays None, like OPSD);
RefKLAdvantageConfig drops group_relative and the dead length_penalty.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ref_kl_loss_fn drops the dead advantage-sign machinery: scalar advantages
are never shipped by ref_kl algorithms, so the sign predicates were
uniformly false and the DPPO mask already degenerated to its low side —
write the one-sided trust region explicitly and stop reading
inputs.advantages (bit-identical for OPD/OPSD; deletes three
constant-zero metrics that diluted same-named rl metrics in mixed bins).

Config validation now also rejects frozen sampling.source for the
ref_kl-family advantages — ref_kl consumes the same live-policy sampling
logprobs as rl (importance ratio, trust region).

Dead code: Algorithm.name, the score_batch wrapper (strategy-era
indirection), ALGORITHM_CLASSES re-export, AdvantageOutputs None-entry
advantages (no producer since the no-scalar algorithms stopped calling
assign_advantages). The orchestrator.advantage shorthand default drops
its never-read constructed value.

Stale naming and docs: token scorers -> reference scoring, "env
algorithm's sampling pool" -> env sampler's pool, model-registry and
[orchestrator.models.*] references, rl-mode-batches loss docstring,
adv_tau pure-distillation claim, wrong [[orchestrator.filters]] key,
and docs/algorithms.md drops the Difficulty Pools / ODF sections —
those features were removed in the orchestrator v2 refactor (#2639).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction

# Conflicts:
#	skills/configs/SKILL.md
#	src/prime_rl/trainer/rl/data.py
#	tests/unit/test_configs.py
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the
implicit success probability instead of pass@1: the policy gradient
averaged over successful rollouts only is unbiased for the
order-group_size truncation of the ML objective's pass@k expansion. In
estimator form that is one change to GRPO — normalize the centered group
reward by the group MEAN instead of the standard deviation, upweighting
low-pass-rate examples like 1/p. group_size becomes the truncation order
(REINFORCE at 1, exact ML in the limit).

New 'max_rl' advantage type + preset: MaxRLAdvantageConfig,
max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs
rows, and a unit test for the estimator. Groups with zero mean reward
carry zero advantages (the paper's no-success convention — the
zero-advantage filter drops them). Everything else rides the existing
GRPO path: policy sampling (enforced by the rl-component guard), rl
loss component, group barrier.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…self

A preset name with explicit advantage/sampling keys is now a parse-time
error instead of a merge: a modified preset is not the preset, so the
config must state what it actually runs. Only the model/teacher
shorthand may accompany a name (the distillation presets are incomplete
without an endpoint by design). Assembly stays cheap — presets are thin
deltas, so a variant costs one explicit 'type' key.

Deletes the merge machinery: _merge_preset_delta and the
discriminator-aware typeless override (advantage = { max_concurrent }
under opd silently inheriting ref_kl) are gone; the preset validator
inserts components, never merges. 'name' becomes write-only input sugar
(excluded from dumps, like 'model') so resolved configs round-trip as
plain component assemblies; the orchestrator startup log now reports
advantage types instead of preset labels. The advantage shorthand gets
a preset-aware error instead of silently relabeling an inherited
preset.

echo.toml's lambda override becomes the assembled spelling, and the
debug configs spell algo as an [orchestrator.algo] section instead of
an inline table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
hallerite and others added 27 commits June 11, 2026 23:09
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection surface generalizes from the observations="tool"|"all"
binary to a role table: each env-provided message role (system / user /
assistant / tool) trains at its own alpha, selected via the renderer's
per-token attribution. An optional filter hook (import_path + kwargs,
matching the custom advantage/loss precedent) narrows the selection per
rollout with one keep-mask per trajectory step.

- completion_obs_mask (bool) -> completion_obs_weights (float): the
  per-token weight carries its role's alpha, so stamping folds it into
  ce_weights directly and stamp_loss_routing drops the scalar
  observation_weight parameter. Orchestrator-internal as before.
- The echo preset is unchanged in meaning: tool-response bodies at 0.1.
  Setting any role replaces the whole table.
- Echo now always requires the renderer (role selection needs
  attribution); the blanket "all" mode is gone — assemble the roles
  you want instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eted

The advantage type now names the algorithm — group_norm -> grpo,
ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes
renamed to match) — and each type's class defaults are its vetted
setting, so 'type = "opd"' with a teacher IS on-policy distillation.

With type-plus-defaults equal to the preset for every algorithm, the
preset layer had nothing left to do: AlgorithmName, _PRESETS, the name
field, and the atomicity guard are deleted. The model/teacher shorthand
survives and now folds by the type's own declarations (model_role ->
advantage.model for opd/opsd; source_role -> sampling.source for sft).
sampling.source loses its None state (it existed only for preset
resolution); sft without a frozen source is rejected at validation —
CE on the policy's own tokens was never the vetted meaning.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the
  missing max_rl/reward/custom rows, fix the frozen-model base_url and
  custom-Algorithm wording, make the length_penalty example
  self-sufficient, drop the Per-Env Advantage section (duplicate of
  Per-Env Algorithms)
- configs/debug/algorithms: README gains max_rl and uses the real type
  names, comments lose leftover preset vocabulary, one wandb project
  for the whole folder
- docs/training.md / skills/configs/SKILL.md: complete type lists and
  a union example that parses on its own

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fully truncated distillation sample (prompt >= trainer seq_len) loses
all its nonzero ce/ref_kl tokens to prepare_sample's truncation while
its stamped all-zero rl_weights suppress the rl branch; with every
component empty, compute_loss returned the Python float 0.0 and
loss.backward() crashed. Seed the rl accumulator with a graph-attached
zero so the degenerate batch trains as a zero-gradient no-op (main's
behavior) and every rank still runs backward, keeping FSDP collectives
in sync.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- ref_kl_loss_fn emitted the same trust-region metric keys as the rl
  loss fn into one shared dict, so mixed batches (per-env algorithms)
  averaged two different trust-region definitions into one wandb
  series. Namespaced as ref_kl/*; the wandb noise filter gets matching
  prefixes and the ref_kl value series is unchanged.
- prepare_sample: a sample with rl member tokens but no advantage now
  raises instead of silently training with advantage 0.0 — the
  orchestrator always stamps a scalar, so a missing one is a producer
  bug (ce/ref_kl-only samples still default to 0.0 legitimately).
- pad_micro_batch: padding fills every weight stream with 0.0 instead
  of the pack-boundary defaults; padding is loss-masked so this is
  training-equivalent, and padded pure-ce batches now read as rl-empty
  in token export, which keys off nonzero weights.
- test_prepare_batch_packs_mixed_components: sorted() multiset checks
  replaced with exact positional asserts in both pack orders, pinning
  STREAM_FILL backfill alignment across the bin boundary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
algo/algorithm.py (all eight classes in one file) splits into one
module per algorithm — grpo, echo, max_rl, opd, opsd, sft, reward,
custom — plus base.py holding Algorithm, connect_frozen_pool, and
score_train_batch. The dispatch table and build_algorithm move to the
package __init__; the shared group-norm assign moves to advantage.py
as assign_group_norm. No behavior change; external imports all go
through the package and are unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection state (echo_roles / echo_filter_fn on the base class)
was pipeline-visible configuration for behavior that lived in
trajectories.py. Replace it with Algorithm.observation_weights(output)
— one per-token ce-weight list per trajectory step; None (the default)
masks all observations out. EchoAlgorithm owns the whole selection
(role table, attribution lookup, user filter + its shape validation);
interleave_rollout just validates alignment and slices each extension
span; the train sink calls the hook and passes data. A custom
algorithm can now implement any observation-token policy by overriding
one method instead of forking interleave_rollout.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… phase entry points

The three hooks are stages of one compilation (rollouts in, component
weight streams out), but the sink still hand-composed phase 1
(observation_weights + interleave_rollout). Algorithm.build_samples now
drives it, completing the pattern: the pipeline hands the algorithm its
rollout / group / batch (build_samples / finalize_group / score) and
never composes algorithm internals; subclasses override only the hooks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cher]

The alias existed; the shipped configs still used the role-neutral
'model' spelling. Configs should say what they mean — every
teacher-meaning table flips to 'teacher' (per Mika's review).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There is never a scalar advantage anywhere in the pipeline:

- Wire: TrainingSample.advantage + token_advantages collapse into one
  advantages: list[float] | None stream (the fourth stream next to the
  rl/ce/ref_kl weights). None = no rl credit assigned (opd/opsd) — legal
  only for samples without live rl member tokens; prepare_sample keeps
  the producer-bug tripwire.
- TrainRollout carries the same single field, aligned to its samples'
  completion tokens (concatenated in step order); rollout dumps keep a
  scalar view (mean) for logging.
- Advantage-fn API: AdvantageOutputs deleted. Functions return
  list[list[float]] aligned to inputs.completion_lengths, with
  inputs.broadcast(...) spreading uniform group credit — GRPO's
  reward-minus-mean is internal math the fn broadcasts on the way out.
- stamp_advantages (replaces spread_token_advantages) pads prompt
  positions with 0.0 and slices the stream across samples.
- ZeroAdvantageFilter checks for all-zero streams; logged advantage
  distributions use per-rollout means.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… frozen-source runs

The renderer object and the renderer client were conflated: frozen-source
runs (sft) had renderer=None forced on them because the renderer-client
sampling path doesn't apply to an external endpoint — which also denied
them the renderer as the canonical messages -> ids path, so sft backfill
fell back to apply_chat_template with the longest-common-prefix repair.

Decoupled: the renderer object exists whenever configured (sft backfill
tokenizes the teacher's messages into the student's token space through
it); the renderer-client sampling path is wired onto the policy pool only
when a train env actually samples from the policy. The
_force_no_renderer_without_policy_sampling validator is gone; pool_size
is rejected when the sampling path never runs. The sft configs now set
the student's renderer explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two statements of the same boundary:

- Constructor: Algorithm(advantage, policy_pool, renderer) — the
  component it interprets plus the two host-owned resources. The bundle
  dissolves at construction (build_algorithm dispatches on it, then
  passes only the advantage; the sibling Sampler already took only
  sampling). Nothing in the runtime reads config.sampling — sampling
  provenance reaches algorithms on the rollouts themselves.
- setup() moves into the subclasses that connect something: the base
  keeps the no-op hook plus connect() (resolving "policy" to the host's
  pool untracked, frozen references to fresh tracked pools); opd and
  opsd own self.teacher_pool under the role name they declare. The
  base class no longer reaches into subclass config fields, and the
  algorithms without references stop carrying a dead reference_pool.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The class surface is now exactly the override surface: two declarations
(action_loss_type, model_role), lifecycle (setup/connect), and three
hooks named for what they produce — observation_weights /
assign_advantages (was assign) / score. The drivers leave the class and
join score_train_batch as module-level phase functions the pipeline
calls: build_samples(algorithm, ...) per arrival, finalize_group(
algorithm, ...) per group, score_train_batch(...) per batch ship.

advantage.py's assign_advantages helper becomes apply_advantage_fn,
freeing the name for the hook and saying what it does: run an advantage
function over one group and write the streams.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
assign_advantages runs before filtering (filters read the streams);
query_references (was score) runs after, on survivors only. The third
hook is gone: observation_weights leaked interleave's step coordinates
into the API and its arrival timing was convenience, not constraint.

Interleaving now records observation-token provenance generically —
obs_spans on each sample map merged completion positions back to
trajectory-step coordinates ([completion_start, step_idx,
step_prompt_start, length]), the provenance-completing sibling of
completion_mask. Echo consumes the spans at group time inside its
assign_advantages: role attribution looked up lazily from rollout.raw,
user filter applied, ce weights written; everything echo-specific now
lives in echo.py. Recording the merge's decision (rather than handing
the algorithm in early or re-deriving the walk) keeps agreement with
interleave structural instead of disciplinary.

Sample construction is pure pipeline again: build_samples is gone and
the sink calls interleave_rollout directly; score_train_batch is
renamed query_batch_references to match its hook.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…bs_weights dies

The intermediate field was a message passed between two halves of
finalize_group through mutable state on the wire struct. Echo now writes
sample.ce_weights itself (the ce component's membership IS its weights —
the trainer never ANDs them with the loss mask), the same way rlcsd
writes sample.advantages directly. stamp_loss_routing shrinks to its one
job: route action tokens into the declared component via the loss mask,
merging with (never clobbering) streams an algorithm already wrote.
Samples where echo selects nothing ship no ce stream at all.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… pair

Advantage functions receive list[TrainRollout] — the same objects the
hooks see — instead of the AdvantageInputs side-struct, which is
deleted. Its two jobs were already subsumed: completion lengths derive
from rollout.samples (broadcast(rollouts, values) is now a module
helper), and its extensibility role belongs to the rollout itself per
the provenance principle. A custom advantage fn is now exactly the
assign_advantages hook body without the class.

The ship-phase driver query_batch_references becomes finalize_batch,
pairing with finalize_group: drivers are named by their barrier, hooks
by their action.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The algorithm surface becomes three hooks, one per pipeline barrier, each
handed a RolloutView (a writable handle exposing only what is valid at its
stage — raw, samples, reward, assign_advantages — never not-yet-assigned
credit or pipeline-internal lifecycle fields):

- score_rollout(rollout)  — one rollout, on arrival: rollout-local credit
  (reward) or observation ce weights (echo, via obs_spans). No siblings.
- score_group(group)      — the cohort, before filtering: group-relative
  credit (grpo / max_rl / sft / custom).
- score_batch(batch)      — survivors, after filtering, async: the only
  stage with model access (opd / opsd teacher prefills).

Scope and timing are one ladder — each wider scope is unlocked by a later
barrier — so model access naturally attaches to the batch stage where the
I/O batches. This homes the two cases the old two-hook API mislocated:
reward was forced through the group hook despite being rollout-local, and
echo's observation weighting was a loop inside the group hook; both are now
their honest stage.

RolloutView (orchestrator/types.py) wraps TrainRollout and folds the old
broadcast helper into assign_advantages(scalar | list): a scalar broadcasts
over the rollout's completion tokens, a list is taken per-token. Advantages
stay per-token everywhere stored or shipped — the scalar is only a write
convenience. advantage fns now take list[RolloutView] and return per-rollout
scalars (default/max_rl) or per-token lists (custom); broadcast is deleted.

The pipeline drives the hooks through three module-level phase functions
(finalize_rollout in process_rollout, finalize_group, finalize_batch); it
never branches on algorithm type. 490 unit tests; grpo + echo (multi-turn)
GPU smokes green across the rollout and group stages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
score_rollout / score_group / score_batch are now all `async def`, and the
module-level drivers (finalize_rollout, finalize_group) await them.
score_batch was already async (teacher prefills); making the other two async
is essentially free and lets any stage do I/O — a process reward model at
arrival, or a judge at group time whose signal a pre-batch filter then reads.
A hook that only does advantage math simply never awaits.

TrainSink.process_group becomes async to await finalize_group (add /
process_rollout were already async). The per-algorithm overrides
(reward.score_rollout; grpo/max_rl/sft/custom.score_group; echo.score_rollout)
gain the `async` keyword; their bodies are unchanged. Unit tests wrap the
now-coroutine hook calls in asyncio.run(). 28 algorithm unit tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Teacher-sourced SFT distillation rolls the teacher out message-based, then
backfills the transcript into the student's token space for the CE target.
That backfill must be faithful to the student's actual chat template, because
the student is sampled and parsed in its own format at inference. The
reverse-text SFT configs set the renderer to `qwen3`, which reimplements the
stock Qwen format and injects an empty `<think></think>` block — but the
student (Qwen3-0.6B-Reverse-Text-SFT) ships a custom template that emits no
such block. The injected tokens corrupt the distillation target and slow
convergence (~step 13 vs main's ~step 8).

Switch the backfill renderer to `default` (DefaultRenderer), which wraps the
student tokenizer's apply_chat_template and is faithful to any custom template
— matching main, where the renderer is forced off for teacher-SFT so backfill
goes straight through apply_chat_template. In a pure-SFT run
orchestrator.renderer is used only for backfill (the student never samples;
the teacher does, message-based; eval is server-side), so this is the complete
fix. Applies to all four teacher-SFT configs: the reverse_text_rl_sft CI smoke
and the three debug sft_distill variants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sft/opd smokes ran only 5 steps and asserted eval reward rose between the
first and last eval — too few steps for distillation to surface (~step 13)
and dominated by truncation noise on 16 examples, making the test flaky. Bump
max_steps to 20 and replace the endpoint-compare with
check_final_eval_reward_above, which asserts the final eval reward clears a
threshold (0.5): a stable signal that the run actually converged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile main's #2723 (sequence-packing rewrite) and #2798 (IPO loss)
with the per-token-stream component architecture. The component model is
the spine; main's two feature PRs fold into it:

- batch.py: adopt main's bin-based packer (_MicroBatchBin, FFD, FLOP-aware
  balanced_partition) and graft the component weight streams
  (rl/ce/ref_kl + STREAM_FILL backfill) and ref_logprobs into
  prepare_sample / _materialize_bin / pad_micro_batch. The training_mode
  packing gate is dropped — loss routing is per token, so heterogeneous
  samples pack together freely.
- loss.py: port ipo_loss_fn into the component model (IPOLossConfig becomes
  an rl-loss option in setup_rl_loss_fn, with the component loss_weights
  applied); main's training_mode dispatch (setup_loss_fns / opd / sft)
  is dropped.
- transport/types.py: MicroBatch keeps ref_logprobs + the weight streams
  and gains main's sequence_lengths; teacher_logprobs / training_mode gone.
- train.py: split the packed sequence by micro_batch.sequence_lengths
  (#2723) in the component compute_loss call.
- test_batch.py: keep main's packer/balance tests + the PR's stream tests;
  drop the training_mode-packing test (replaced by the mixed-component
  packing test); identify real/padding batches by content, since
  balanced_partition reorders.

176 CPU unit tests pass; ruff clean; trainer modules import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the MAI linear length penalty (from #2702) to the GRPO-family
algorithms on top of the algorithm abstraction. A new `linear` arm joins
the existing `tokens` / `turns` length penalties, plus an optional
`length_weighted_baseline` on GRPO.

The linear penalty is modeled as a *separate additive advantage*: each
reward is reduced by `coef * pass_rate * (completion tokens / seq_len)`
before centering, which — because centering is linear — is identical to
summing a standalone, group-centered penalty advantage onto plain GRPO:

    center(reward - penalty) = center(reward) + center(-penalty)

So `advantage.py` exposes pure, cohesive functions — `grpo_advantage`,
`length_penalty_advantage`, `efficiency_shaping_advantage` — and
`GRPOAlgorithm.score_group` composes GRPO + linear penalty by summation
instead of folding extra knobs into one shared function. `tokens`/`turns`
stay non-additive shaping; `length_weighted_baseline` stays a GRPO baseline
choice.

`max_seq_len` (the penalty denominator, = orchestrator.seq_len) is injected
onto the Algorithm by `build_algorithm` and threaded from the orchestrator
through TrainEnvs, so it touches only the GRPO path rather than every
algorithm's constructor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-weighted baseline

The summed decomposition is exact only when both terms use the same
baseline operator. length_penalty_advantage centered the penalty by the
plain mean unconditionally, so combining the linear penalty with
length_weighted_baseline diverged from #2702 (which length-weight-centered
reward-minus-penalty). Thread length_weighted_baseline into the penalty term
so it centers the same way as the GRPO term.

Verified numerically against a faithful #2702 reimplementation: all four
gate_by_correctness × length_weighted_baseline combinations are now
bit-for-bit identical. Adds a regression test for the combined path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/algorithm-abstraction branch from 87c38c7 to d33b3f4 Compare June 24, 2026 05:56
Base automatically changed from feat/algorithm-abstraction to main June 27, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant