Add linear length penalty as a summed GRPO advantage#2829
Draft
hallerite wants to merge 66 commits into
Draft
Conversation
…d, sft_distill, self_distill, echo) Replace the global training_mode enum with a per-env Algorithm abstraction: a preset bundle of (1) sampling source, (2) scoring (group advantage + async token scorer), and (3) per-token loss routing. The trainer becomes algorithm-blind: routing ships per token on the wire and the trainer executes three fixed loss cores (rl / ce / teacher_kl). - configs: new prime_rl.configs.algorithm with AlgorithmConfig presets, component-level overrides, compatibility validation (incl. the group-relative-advantage-with-group_size=1 footgun warning); training_mode kept as a deprecated alias - orchestrator: per-env algorithm; dispatcher selects student/teacher pool per env (no mode branches); OPD teacher logprobs moved out of finalize_train_batch into a bounded-concurrency token scorer; demo-conditioned teacher scorer for SDFT; interleave_rollout can tag env-observation tokens for ECHO - wire: TrainingSample/MicroBatch carry loss_core + optional per-token cores/weights/advantages (omit_defaults — plain GRPO wire unchanged); packer no longer bins by mode - trainer: unified per-token loss routing, bit-for-bit with the previous rl/opd/sft loss fns on pure batches Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke runs for grpo (reverse_text), opd (teacher pool + alias path), and echo (multi-turn alphabet-sort, per-token routing verified on the wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hm strategy object "Teacher" is no longer a concept anywhere in the system. There is the live policy (reserved registry key "policy") and named frozen hosted models under [orchestrator.models.<key>]; algorithm components hold references into that registry. The same entry can serve any number of envs' algorithms, and self_distill can point its demo scorer at "policy" itself — the SDFT paper's setting, zero extra deployments. - configs: scorer types logprobs/demo_logprobs with required model refs; sampling.source is a registry key; algorithm.model shorthand folds into the unresolved component; orchestrator.teacher and training_mode deleted; student renamed policy; registry validation (refs resolve, entries used, "policy" reserved, degenerate logprobs@policy rejected) - runtime: ModelRegistry + per-env Algorithm strategy object as the sole interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and never branch on algorithm config; liveness drives cache salting, sampling logprobs, and off-policy aging (frozen-sourced rollouts no longer age) - wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl, time/scoring metric - fixes found by the new SDFT smoke: resolved-config round-trip (shorthands are now write-only / excluded from dumps) and apply_chat_template returning BatchEncoding on newer transformers - configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml running SDFT against the live policy); docs/skills updated Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry 0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32 trainable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gy ontology Every training signal is an advantage — varying in granularity (group-scalar vs per-token) and evaluation site (orchestrator vs trainer). The advantage union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs -> demo_ref_kl), the action-token loss core derives from the strategy instead of being configured (loss.action deleted), and runtime AdvantageStrategy objects own both execution points: group-time assign() and ship-time score(). Also fixes a shorthand-folding regression: resolve_preset's component assignment polluted model_fields_set, so any [orchestrator.advantage] shorthand differing from the preset raised a bogus conflict error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py
A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones (grpo/echo) extended ref_logprobs without backfilling or padding, shifting it out of alignment with input_ids. Mirror the rewards/loss_core_ids pattern with 0.0 placeholders (already the outside-the-mask filler used by the demo scorer and pad_micro_batch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The config surface key is now [orchestrator.algo] (per-env: algo = {...});
the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL,
TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids,
advantage.action_loss_type). Also scrubs stale token-scorer mentions from
the ref_kl error message and the configs skill.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and `[orchestrator.student]` fold in as aliases, with the canonical key winning at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys still re-nest ([orchestrator.model] name = ...). Shared-field propagation checks all spellings for conflicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two envs with different algorithms in one run — exercises heterogeneous train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…r references Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… registry prime-rl now assumes it only ever hosts the trainable policy. Frozen models are external endpoints declared inline on the algorithm component that uses them (FrozenModelConfig: model.name + required client.base_url) — no more [orchestrator.models] namespace or runtime ModelRegistry. Each env's Algorithm builds and readies its own frozen pools in async setup(); the dispatcher reads algorithm.sampling_pool and gets the policy pool directly. References are "policy" | inline config; demo_ref_kl now defaults to "policy" (the SDFT setting needs zero config). The algo.model shorthand folds with fill-or-agree semantics, which also fixes the two Bugbot findings (redundant-but-consistent model rejected; advantage shorthand clearing a folded model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A frozen model reference is the client config we already have plus the one
request-level datum it lacks: the served model's name. Drops the nested
{model, client} shape — TOML reads `[orchestrator.algo.model]` with
name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning,
which still read the deleted [orchestrator.models] dict.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The registry's reserved-key concept is gone; "policy" survives only as the Literal arm of ModelReference, which the type already enforces. Inline the string at the comparison sites. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the LOSS_TYPE_* int constants with a LossType IntEnum. The scalar TrainingSample.loss_type field carries the enum (msgspec validates membership on decode, so a corrupt wire value fails loudly); the per-token arrays stay list[int] — the trainer tensorizes them immediately, so per-token enum wrapping on decode would be pure overhead. Wire bytes are unchanged: members encode as plain ints. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gies AdvantageOutputs.token_advantages carries optional per-rollout lists aligned to completion tokens; finalize_group pads prompt positions and stamps TrainingSample.token_advantages (the trainer side already preferred it over the scalar broadcast). Rollouts that split into several samples or misaligned lengths are rejected loudly. Verified end-to-end on a 5-step reverse-text run: a custom strategy emitting alternating scalar/scalar*0.5 advantages shows the exact pattern in the trainer token export for all 128 sequences. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
algo/algorithm.py (Algorithm runtime + frozen pools + score_train_batch), algo/strategies.py (AdvantageStrategy objects), algo/advantage.py (pure math + the custom-fn interface, moved), algo/routing.py (wire stamping). Public surface re-exported from the package root; the user-facing custom advantage import is now prime_rl.orchestrator.algo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…frozen_pool/connected_pools The names sounded like prime-rl hosts the frozen models. It only opens client pools to externally hosted endpoints — connecting, never launching; shutdown closes clients, never servers. Docstrings aligned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace per-token loss-type routing (one loss type per token) with a sum of three components -- rl, ce, ref_kl -- each with its own per-token weight stream and its own global normalization: L = sum(L_rl)/N_rl + sum(L_ce)/N_ce + sum(L_ref_kl)/N_ref_kl Wire: loss_type / token_loss_types / token_loss_weights (and the LossType enum) are replaced by optional rl_weights / ce_weights / ref_kl_weights streams. Absent streams mean rl weight 1.0 on the loss mask, so plain GRPO ships nothing extra and keeps the no-sync hot path. A weight scales its component's per-token loss; 0.0 removes the token from the component's mask and denominator; components may overlap on the same token (gradients sum). Per-env component weights fold into the streams orchestrator-side, so the trainer stays algorithm-blind. Per-component normalization fixes the echo dilution bug: observation tokens no longer flip into completion_mask (the ce stream trains them), so they leave the rl denominator and the rl term's effective per-token learning rate no longer scales with the batch's obs/action ratio. The three denominators come from one batched all-reduce, so every rank issues the same collective. Smoke-validated 5 steps each (exit 0): grpo eval 0.13->0.53 (hot path), echo on alphabet-sort (ce stream on obs tokens), opd eval 0.067->0.684 (ref_kl stream, rl stream zeroed, inline frozen reference). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nput Preset resolution moves from after-validators that mutate validated models to a mode="before" merge on raw dicts: _PRESETS is now a plain table of component deltas from grpo (the field defaults), merged under the user's keys before any model is built. This deletes the preset lambda factories, the __pydantic_fields_set__.discard provenance surgery, and the resolve_algorithm fold/re-validate machinery — field provenance is now exactly what the user wrote, so nothing needs fixing up after the fact. - AlgorithmConfig components are non-optional with grpo defaults; presets encode only their deviation. A typeless advantage override now inherits the preset's strategy type (merge is discriminator-aware: a differing type replaces the strategy wholesale). - The orchestrator/env `advantage` shorthands fold into algo.advantage in a before-validator; env algorithm inheritance stays a small after-validator (covers default-constructed envs). Each AlgorithmConfig validates exactly once. - fold_model_shorthand is inlined back into the fold_model validator (no re-callable indirection needed anymore). Validated: full unit suite, resolved-config round-trip, legacy [[env]] layout, and a 5-step grpo smoke (reward 0.13 -> 0.55, 128/128 trainable). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d model roles
The AdvantageStrategy layer dissolves into named runtime classes
(GRPOAlgorithm, OPDAlgorithm, OPSDAlgorithm, SFTDistillAlgorithm,
RewardAlgorithm, CustomAlgorithm) — each owns its group-time assign() and
ship-time score() directly, so reading a class top to bottom reads the
algorithm, and writing your own is subclassing Algorithm and overriding
the same two methods. Orchestration duplication between similar
algorithms (OPD/OPSD) is accepted; shared math stays as plain functions
in algo/advantage.py. Dispatch is keyed on advantage.type (preset names
are vetted parameterizations: echo builds GRPOAlgorithm with observation
routing); the reference pool moves from the strategy to the base class.
Algorithms also declare what they need: action_loss_type as a ClassVar
on the class, and model_role ("teacher" on the distillation algorithms),
which makes [orchestrator.algo.teacher] a parse-time alias for the model
shorthand and puts the same word in validation errors. Roles stay
algorithm-local: flow code and the wire keep branching on liveness only.
Config changes are additive (teacher alias, role-aware messages); the
preset/component layer and the wire are untouched.
Validated: full unit suite, 5-step grpo and self_distill GPU smokes
(both exit 0; OPSD scoring against the live policy pool).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nizer leaves the runtime Text -> token ids always goes through the renderer — the same path the policy's own prompts take — so OPSD's demo-conditioned scoring prefix is rendered with renderer.render_ids instead of the raw HF chat template. demo_ref_kl now requires a renderer, validated at config time: rendering the scoring prefix differently than the policy's prompts is rejected rather than approximated, which is what the tokenizer fallback was. With the fallback gone the tokenizer has no consumer left in the algorithm runtime, so the constructor flattens to Algorithm(config, policy_pool, renderer) — no context wrapper. Validated: full unit suite, 5-step self_distill GPU smoke exit 0 (prefix rendered via Qwen3Renderer.render_ids against the policy pool). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…the env Sampler An algorithm is credit assignment and loss routing, fused: one mapping from a finalized rollout to per-token (loss component, weight). algo.loss is deleted — echo becomes a proper advantage type (EchoAdvantageConfig / EchoAlgorithm) and every preset delta is now a component-type swap. The bundle's other half, sampling, gets its own runtime object: Sampler owns the generating pool, frozen-source connection, sampling args, and liveness; future replay/branching strategies extend there, not the algorithm. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Group-norm scalars on OPD rollouts were dead weight: ref_kl_loss_fn zeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter — which dropped uniform-reward OPD groups despite their full teacher-KL signal. A/B on reverse-text (50 steps) ties with the scalar baseline (0.825 vs 0.828). OPDAlgorithm loses its assign override (advantage stays None, like OPSD); RefKLAdvantageConfig drops group_relative and the dead length_penalty. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ref_kl_loss_fn drops the dead advantage-sign machinery: scalar advantages are never shipped by ref_kl algorithms, so the sign predicates were uniformly false and the DPPO mask already degenerated to its low side — write the one-sided trust region explicitly and stop reading inputs.advantages (bit-identical for OPD/OPSD; deletes three constant-zero metrics that diluted same-named rl metrics in mixed bins). Config validation now also rejects frozen sampling.source for the ref_kl-family advantages — ref_kl consumes the same live-policy sampling logprobs as rl (importance ratio, trust region). Dead code: Algorithm.name, the score_batch wrapper (strategy-era indirection), ALGORITHM_CLASSES re-export, AdvantageOutputs None-entry advantages (no producer since the no-scalar algorithms stopped calling assign_advantages). The orchestrator.advantage shorthand default drops its never-read constructed value. Stale naming and docs: token scorers -> reference scoring, "env algorithm's sampling pool" -> env sampler's pool, model-registry and [orchestrator.models.*] references, rl-mode-batches loss docstring, adv_tau pure-distillation claim, wrong [[orchestrator.filters]] key, and docs/algorithms.md drops the Difficulty Pools / ODF sections — those features were removed in the orchestrator v2 refactor (#2639). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction # Conflicts: # skills/configs/SKILL.md # src/prime_rl/trainer/rl/data.py # tests/unit/test_configs.py
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the implicit success probability instead of pass@1: the policy gradient averaged over successful rollouts only is unbiased for the order-group_size truncation of the ML objective's pass@k expansion. In estimator form that is one change to GRPO — normalize the centered group reward by the group MEAN instead of the standard deviation, upweighting low-pass-rate examples like 1/p. group_size becomes the truncation order (REINFORCE at 1, exact ML in the limit). New 'max_rl' advantage type + preset: MaxRLAdvantageConfig, max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs rows, and a unit test for the estimator. Groups with zero mean reward carry zero advantages (the paper's no-success convention — the zero-advantage filter drops them). Everything else rides the existing GRPO path: policy sampling (enforced by the rl-component guard), rl loss component, group barrier. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…self
A preset name with explicit advantage/sampling keys is now a parse-time
error instead of a merge: a modified preset is not the preset, so the
config must state what it actually runs. Only the model/teacher
shorthand may accompany a name (the distillation presets are incomplete
without an endpoint by design). Assembly stays cheap — presets are thin
deltas, so a variant costs one explicit 'type' key.
Deletes the merge machinery: _merge_preset_delta and the
discriminator-aware typeless override (advantage = { max_concurrent }
under opd silently inheriting ref_kl) are gone; the preset validator
inserts components, never merges. 'name' becomes write-only input sugar
(excluded from dumps, like 'model') so resolved configs round-trip as
plain component assemblies; the orchestrator startup log now reports
advantage types instead of preset labels. The advantage shorthand gets
a preset-aware error instead of silently relabeling an inherited
preset.
echo.toml's lambda override becomes the assembled spelling, and the
debug configs spell algo as an [orchestrator.algo] section instead of
an inline table.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection surface generalizes from the observations="tool"|"all" binary to a role table: each env-provided message role (system / user / assistant / tool) trains at its own alpha, selected via the renderer's per-token attribution. An optional filter hook (import_path + kwargs, matching the custom advantage/loss precedent) narrows the selection per rollout with one keep-mask per trajectory step. - completion_obs_mask (bool) -> completion_obs_weights (float): the per-token weight carries its role's alpha, so stamping folds it into ce_weights directly and stamp_loss_routing drops the scalar observation_weight parameter. Orchestrator-internal as before. - The echo preset is unchanged in meaning: tool-response bodies at 0.1. Setting any role replaces the whole table. - Echo now always requires the renderer (role selection needs attribution); the blanket "all" mode is gone — assemble the roles you want instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eted The advantage type now names the algorithm — group_norm -> grpo, ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes renamed to match) — and each type's class defaults are its vetted setting, so 'type = "opd"' with a teacher IS on-policy distillation. With type-plus-defaults equal to the preset for every algorithm, the preset layer had nothing left to do: AlgorithmName, _PRESETS, the name field, and the atomicity guard are deleted. The model/teacher shorthand survives and now folds by the type's own declarations (model_role -> advantage.model for opd/opsd; source_role -> sampling.source for sft). sampling.source loses its None state (it existed only for preset resolution); sft without a frozen source is rejected at validation — CE on the policy's own tokens was never the vetted meaning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the missing max_rl/reward/custom rows, fix the frozen-model base_url and custom-Algorithm wording, make the length_penalty example self-sufficient, drop the Per-Env Advantage section (duplicate of Per-Env Algorithms) - configs/debug/algorithms: README gains max_rl and uses the real type names, comments lose leftover preset vocabulary, one wandb project for the whole folder - docs/training.md / skills/configs/SKILL.md: complete type lists and a union example that parses on its own Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fully truncated distillation sample (prompt >= trainer seq_len) loses all its nonzero ce/ref_kl tokens to prepare_sample's truncation while its stamped all-zero rl_weights suppress the rl branch; with every component empty, compute_loss returned the Python float 0.0 and loss.backward() crashed. Seed the rl accumulator with a graph-attached zero so the degenerate batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- ref_kl_loss_fn emitted the same trust-region metric keys as the rl loss fn into one shared dict, so mixed batches (per-env algorithms) averaged two different trust-region definitions into one wandb series. Namespaced as ref_kl/*; the wandb noise filter gets matching prefixes and the ref_kl value series is unchanged. - prepare_sample: a sample with rl member tokens but no advantage now raises instead of silently training with advantage 0.0 — the orchestrator always stamps a scalar, so a missing one is a producer bug (ce/ref_kl-only samples still default to 0.0 legitimately). - pad_micro_batch: padding fills every weight stream with 0.0 instead of the pack-boundary defaults; padding is loss-masked so this is training-equivalent, and padded pure-ce batches now read as rl-empty in token export, which keys off nonzero weights. - test_prepare_batch_packs_mixed_components: sorted() multiset checks replaced with exact positional asserts in both pack orders, pinning STREAM_FILL backfill alignment across the bin boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
algo/algorithm.py (all eight classes in one file) splits into one module per algorithm — grpo, echo, max_rl, opd, opsd, sft, reward, custom — plus base.py holding Algorithm, connect_frozen_pool, and score_train_batch. The dispatch table and build_algorithm move to the package __init__; the shared group-norm assign moves to advantage.py as assign_group_norm. No behavior change; external imports all go through the package and are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection state (echo_roles / echo_filter_fn on the base class) was pipeline-visible configuration for behavior that lived in trajectories.py. Replace it with Algorithm.observation_weights(output) — one per-token ce-weight list per trajectory step; None (the default) masks all observations out. EchoAlgorithm owns the whole selection (role table, attribution lookup, user filter + its shape validation); interleave_rollout just validates alignment and slices each extension span; the train sink calls the hook and passes data. A custom algorithm can now implement any observation-token policy by overriding one method instead of forking interleave_rollout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… phase entry points The three hooks are stages of one compilation (rollouts in, component weight streams out), but the sink still hand-composed phase 1 (observation_weights + interleave_rollout). Algorithm.build_samples now drives it, completing the pattern: the pipeline hands the algorithm its rollout / group / batch (build_samples / finalize_group / score) and never composes algorithm internals; subclasses override only the hooks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cher] The alias existed; the shipped configs still used the role-neutral 'model' spelling. Configs should say what they mean — every teacher-meaning table flips to 'teacher' (per Mika's review). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There is never a scalar advantage anywhere in the pipeline: - Wire: TrainingSample.advantage + token_advantages collapse into one advantages: list[float] | None stream (the fourth stream next to the rl/ce/ref_kl weights). None = no rl credit assigned (opd/opsd) — legal only for samples without live rl member tokens; prepare_sample keeps the producer-bug tripwire. - TrainRollout carries the same single field, aligned to its samples' completion tokens (concatenated in step order); rollout dumps keep a scalar view (mean) for logging. - Advantage-fn API: AdvantageOutputs deleted. Functions return list[list[float]] aligned to inputs.completion_lengths, with inputs.broadcast(...) spreading uniform group credit — GRPO's reward-minus-mean is internal math the fn broadcasts on the way out. - stamp_advantages (replaces spread_token_advantages) pads prompt positions with 0.0 and slices the stream across samples. - ZeroAdvantageFilter checks for all-zero streams; logged advantage distributions use per-rollout means. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… frozen-source runs The renderer object and the renderer client were conflated: frozen-source runs (sft) had renderer=None forced on them because the renderer-client sampling path doesn't apply to an external endpoint — which also denied them the renderer as the canonical messages -> ids path, so sft backfill fell back to apply_chat_template with the longest-common-prefix repair. Decoupled: the renderer object exists whenever configured (sft backfill tokenizes the teacher's messages into the student's token space through it); the renderer-client sampling path is wired onto the policy pool only when a train env actually samples from the policy. The _force_no_renderer_without_policy_sampling validator is gone; pool_size is rejected when the sampling path never runs. The sft configs now set the student's renderer explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two statements of the same boundary: - Constructor: Algorithm(advantage, policy_pool, renderer) — the component it interprets plus the two host-owned resources. The bundle dissolves at construction (build_algorithm dispatches on it, then passes only the advantage; the sibling Sampler already took only sampling). Nothing in the runtime reads config.sampling — sampling provenance reaches algorithms on the rollouts themselves. - setup() moves into the subclasses that connect something: the base keeps the no-op hook plus connect() (resolving "policy" to the host's pool untracked, frozen references to fresh tracked pools); opd and opsd own self.teacher_pool under the role name they declare. The base class no longer reaches into subclass config fields, and the algorithms without references stop carrying a dead reference_pool. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The class surface is now exactly the override surface: two declarations (action_loss_type, model_role), lifecycle (setup/connect), and three hooks named for what they produce — observation_weights / assign_advantages (was assign) / score. The drivers leave the class and join score_train_batch as module-level phase functions the pipeline calls: build_samples(algorithm, ...) per arrival, finalize_group( algorithm, ...) per group, score_train_batch(...) per batch ship. advantage.py's assign_advantages helper becomes apply_advantage_fn, freeing the name for the hook and saying what it does: run an advantage function over one group and write the streams. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
assign_advantages runs before filtering (filters read the streams); query_references (was score) runs after, on survivors only. The third hook is gone: observation_weights leaked interleave's step coordinates into the API and its arrival timing was convenience, not constraint. Interleaving now records observation-token provenance generically — obs_spans on each sample map merged completion positions back to trajectory-step coordinates ([completion_start, step_idx, step_prompt_start, length]), the provenance-completing sibling of completion_mask. Echo consumes the spans at group time inside its assign_advantages: role attribution looked up lazily from rollout.raw, user filter applied, ce weights written; everything echo-specific now lives in echo.py. Recording the merge's decision (rather than handing the algorithm in early or re-deriving the walk) keeps agreement with interleave structural instead of disciplinary. Sample construction is pure pipeline again: build_samples is gone and the sink calls interleave_rollout directly; score_train_batch is renamed query_batch_references to match its hook. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…bs_weights dies The intermediate field was a message passed between two halves of finalize_group through mutable state on the wire struct. Echo now writes sample.ce_weights itself (the ce component's membership IS its weights — the trainer never ANDs them with the loss mask), the same way rlcsd writes sample.advantages directly. stamp_loss_routing shrinks to its one job: route action tokens into the declared component via the loss mask, merging with (never clobbering) streams an algorithm already wrote. Samples where echo selects nothing ship no ce stream at all. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… pair Advantage functions receive list[TrainRollout] — the same objects the hooks see — instead of the AdvantageInputs side-struct, which is deleted. Its two jobs were already subsumed: completion lengths derive from rollout.samples (broadcast(rollouts, values) is now a module helper), and its extensibility role belongs to the rollout itself per the provenance principle. A custom advantage fn is now exactly the assign_advantages hook body without the class. The ship-phase driver query_batch_references becomes finalize_batch, pairing with finalize_group: drivers are named by their barrier, hooks by their action. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The algorithm surface becomes three hooks, one per pipeline barrier, each handed a RolloutView (a writable handle exposing only what is valid at its stage — raw, samples, reward, assign_advantages — never not-yet-assigned credit or pipeline-internal lifecycle fields): - score_rollout(rollout) — one rollout, on arrival: rollout-local credit (reward) or observation ce weights (echo, via obs_spans). No siblings. - score_group(group) — the cohort, before filtering: group-relative credit (grpo / max_rl / sft / custom). - score_batch(batch) — survivors, after filtering, async: the only stage with model access (opd / opsd teacher prefills). Scope and timing are one ladder — each wider scope is unlocked by a later barrier — so model access naturally attaches to the batch stage where the I/O batches. This homes the two cases the old two-hook API mislocated: reward was forced through the group hook despite being rollout-local, and echo's observation weighting was a loop inside the group hook; both are now their honest stage. RolloutView (orchestrator/types.py) wraps TrainRollout and folds the old broadcast helper into assign_advantages(scalar | list): a scalar broadcasts over the rollout's completion tokens, a list is taken per-token. Advantages stay per-token everywhere stored or shipped — the scalar is only a write convenience. advantage fns now take list[RolloutView] and return per-rollout scalars (default/max_rl) or per-token lists (custom); broadcast is deleted. The pipeline drives the hooks through three module-level phase functions (finalize_rollout in process_rollout, finalize_group, finalize_batch); it never branches on algorithm type. 490 unit tests; grpo + echo (multi-turn) GPU smokes green across the rollout and group stages. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
score_rollout / score_group / score_batch are now all `async def`, and the module-level drivers (finalize_rollout, finalize_group) await them. score_batch was already async (teacher prefills); making the other two async is essentially free and lets any stage do I/O — a process reward model at arrival, or a judge at group time whose signal a pre-batch filter then reads. A hook that only does advantage math simply never awaits. TrainSink.process_group becomes async to await finalize_group (add / process_rollout were already async). The per-algorithm overrides (reward.score_rollout; grpo/max_rl/sft/custom.score_group; echo.score_rollout) gain the `async` keyword; their bodies are unchanged. Unit tests wrap the now-coroutine hook calls in asyncio.run(). 28 algorithm unit tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Teacher-sourced SFT distillation rolls the teacher out message-based, then backfills the transcript into the student's token space for the CE target. That backfill must be faithful to the student's actual chat template, because the student is sampled and parsed in its own format at inference. The reverse-text SFT configs set the renderer to `qwen3`, which reimplements the stock Qwen format and injects an empty `<think></think>` block — but the student (Qwen3-0.6B-Reverse-Text-SFT) ships a custom template that emits no such block. The injected tokens corrupt the distillation target and slow convergence (~step 13 vs main's ~step 8). Switch the backfill renderer to `default` (DefaultRenderer), which wraps the student tokenizer's apply_chat_template and is faithful to any custom template — matching main, where the renderer is forced off for teacher-SFT so backfill goes straight through apply_chat_template. In a pure-SFT run orchestrator.renderer is used only for backfill (the student never samples; the teacher does, message-based; eval is server-side), so this is the complete fix. Applies to all four teacher-SFT configs: the reverse_text_rl_sft CI smoke and the three debug sft_distill variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sft/opd smokes ran only 5 steps and asserted eval reward rose between the first and last eval — too few steps for distillation to surface (~step 13) and dominated by truncation noise on 16 examples, making the test flaky. Bump max_steps to 20 and replace the endpoint-compare with check_final_eval_reward_above, which asserts the final eval reward clears a threshold (0.5): a stable signal that the run actually converged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile main's #2723 (sequence-packing rewrite) and #2798 (IPO loss) with the per-token-stream component architecture. The component model is the spine; main's two feature PRs fold into it: - batch.py: adopt main's bin-based packer (_MicroBatchBin, FFD, FLOP-aware balanced_partition) and graft the component weight streams (rl/ce/ref_kl + STREAM_FILL backfill) and ref_logprobs into prepare_sample / _materialize_bin / pad_micro_batch. The training_mode packing gate is dropped — loss routing is per token, so heterogeneous samples pack together freely. - loss.py: port ipo_loss_fn into the component model (IPOLossConfig becomes an rl-loss option in setup_rl_loss_fn, with the component loss_weights applied); main's training_mode dispatch (setup_loss_fns / opd / sft) is dropped. - transport/types.py: MicroBatch keeps ref_logprobs + the weight streams and gains main's sequence_lengths; teacher_logprobs / training_mode gone. - train.py: split the packed sequence by micro_batch.sequence_lengths (#2723) in the component compute_loss call. - test_batch.py: keep main's packer/balance tests + the PR's stream tests; drop the training_mode-packing test (replaced by the mixed-component packing test); identify real/padding batches by content, since balanced_partition reorders. 176 CPU unit tests pass; ruff clean; trainer modules import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the MAI linear length penalty (from #2702) to the GRPO-family algorithms on top of the algorithm abstraction. A new `linear` arm joins the existing `tokens` / `turns` length penalties, plus an optional `length_weighted_baseline` on GRPO. The linear penalty is modeled as a *separate additive advantage*: each reward is reduced by `coef * pass_rate * (completion tokens / seq_len)` before centering, which — because centering is linear — is identical to summing a standalone, group-centered penalty advantage onto plain GRPO: center(reward - penalty) = center(reward) + center(-penalty) So `advantage.py` exposes pure, cohesive functions — `grpo_advantage`, `length_penalty_advantage`, `efficiency_shaping_advantage` — and `GRPOAlgorithm.score_group` composes GRPO + linear penalty by summation instead of folding extra knobs into one shared function. `tokens`/`turns` stay non-additive shaping; `length_weighted_baseline` stays a GRPO baseline choice. `max_seq_len` (the penalty denominator, = orchestrator.seq_len) is injected onto the Algorithm by `build_algorithm` and threaded from the orchestrator through TrainEnvs, so it touches only the GRPO path rather than every algorithm's constructor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-weighted baseline The summed decomposition is exact only when both terms use the same baseline operator. length_penalty_advantage centered the penalty by the plain mean unconditionally, so combining the linear penalty with length_weighted_baseline diverged from #2702 (which length-weight-centered reward-minus-penalty). Thread length_weighted_baseline into the penalty term so it centers the same way as the GRPO term. Verified numerically against a faithful #2702 reimplementation: all four gate_by_correctness × length_weighted_baseline combinations are now bit-for-bit identical. Adds a regression test for the combined path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
87c38c7 to
d33b3f4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #2746 (base:
feat/algorithm-abstraction).What
Brings the MAI linear length penalty (from #2702) into the algorithm abstraction as a third length-penalty arm on the GRPO-family algorithms, plus an optional length-weighted baseline. No new algorithm type — length penalties stay "just configs" under
[orchestrator.advantage.length_penalty].type = "linear"(LinearLengthPenaltyConfig): subtractscoef * pass_rate * (completion tokens / orchestrator.seq_len)from each reward before the baseline, wherepass_rateis the group's mean reward — so reliably-solved problems get the strongest concision pressure and never-solved groups get none. Optionalgate_by_correctnessrestricts it to correct rollouts (reward == 1).length_weighted_baselineon GRPO: usesum(len_i · reward_i) / sum(len_i)as the baseline instead of the plain mean.The existing
tokens/turnsefficiency-shaping arms are unchanged.The penalty is a summed advantage, not a folded reward
Because centering is linear:
So the linear penalty is modeled as a standalone additive advantage that sums onto plain GRPO, rather than threading penalty/baseline knobs through one shared group-norm function.
advantage.pynow exposes three pure, cohesive functions:grpo_advantage(group, length_weighted_baseline)— plain GRPO (owns the baseline choice).length_penalty_advantage(group, config, max_seq_len, length_weighted_baseline)— the group-centered negative penalty−(pᵢ − baseline)(ownsmax_seq_len).efficiency_shaping_advantage(group, config)—tokens/turns(non-additive; replaces the baseline).GRPOAlgorithm.score_groupcomposes GRPO + linear penalty by summation.tokens/turnsremain non-additive shaping. The decomposition is exact only when both terms use the same baseline operator, solength_weighted_baselineis threaded into the penalty term too (it centers by the plain or the token-length-weighted mean to match the GRPO term).Wiring
max_seq_len(the penalty denominator) is injected onto theAlgorithmbybuild_algorithmand threaded from the orchestrator (config.seq_len) throughTrainEnvs— so it touches only the GRPO path rather than every algorithm's constructor.Verification
Numerically validated against a faithful reimplementation of #2702's original folded
default_advantage_fnover 2000 random groups per case: all fourgate_by_correctness×length_weighted_baselinecombinations are bit-for-bit identical. (An earlier cut centered the penalty by the plain mean unconditionally and diverged from #2702 when the linear penalty metlength_weighted_baseline; the fix threads the baseline operator into the penalty term — see the second commit.)uv run pytest tests/unit/orchestrator -q→ all passing. New unit tests cover pass-rate scaling, the GRPO + penalty sum equivalence (vs. folding penalty into reward, plain and length-weighted), correctness gating, zero pass-rate no-op, and the end-to-endGRPOAlgorithm.score_grouppath includingmax_seq_leninjection. Lint clean.🤖 Generated with Claude Code