THUDM · jvmncs · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 11, 2026
diff --git a/docs/en/advanced/delta-weight-sync.md b/docs/en/advanced/delta-weight-sync.md
@@ -4,6 +4,7 @@
 - [Quick Start](#quick-start)
 - [Mode vs Transport](#mode-vs-transport)
 - [How It Works](#how-it-works)
+- [Publish-Only Disk Delta](#publish-only-disk-delta)
 - [Encoding Choice](#encoding-choice)
 - [Why Not Colocated](#why-not-colocated)
 
@@ -92,6 +93,35 @@ For both transports, the receiver ends up calling the same `_apply_delta_payload
 
 Selective overwrite has no arithmetic — the receiver writes the trainer's exact bytes at changed positions — so it's lossless by construction and there's no notion of drift to fight with periodic base re-syncs.
 
+## Publish-Only Disk Delta
+
+The disk path above pushes each version to known engines: rank 0 calls every engine's `update_weights_from_disk(load_format="delta")` and the sync ends when all engines acknowledge. That requires stable engine handles. When the serving side is an elastic fleet that consumes published versions on its own schedule — e.g. behind an [opaque HTTP rollout endpoint](external-rollout-engines.md#opaque-http-rollout-endpoint) — invert the direction with publish-only mode:
+
+```bash
+--update-weight-mode delta
+--update-weight-transport disk
+--update-weight-delta-publish-only
+--custom-delta-publish-path my_pkg.publish.publish_delta
+--update-weight-delta-keep-files
+```
+
+Instead of firing per-engine RPCs, rank 0 invokes your publish hook once per sync, after every delta file has been written and the optional `--custom-delta-pre-push-path` hook has committed:
+
+```python
+def publish_delta(args, version_dir: str, files: list[str], weight_version: str, rollout_engines) -> list | None:
+    ...  # e.g. upload version_dir to object storage, then announce weight_version
+```
+
+Returned Ray ObjectRefs are awaited before the version counts as settled. Behavior differences from the direct disk path:
+
+- **One complete version per sync.** Direct disk transport publishes at each pass boundary so receivers can overlap apply with later encoding; publish-only defers everything to finalize, so external consumers never observe a partially published version.
+- **Publish wait is configurable.** By default, `--update-weight-delta-publish-wait=next-sync` leaves the dispatched publish in flight across the next training step and settles it at the start of the next sync (or on disconnect). A failed publish therefore surfaces one sync late, on rank 0. Set `--update-weight-delta-publish-wait=sync` when the publish hook should block `update_weights`, for example because it polls an external rollout fleet until enough replicas report the new version ready.
+- **Engines are left alone.** Generation is not paused, caches are not flushed, and no update RPCs are issued; consumers decide when to pick up a version. If the rollout endpoint supports request-level weight constraints, attach them from a `--custom-rollout-request-hook-path` hook so requests routed to lagging replicas fail/retry before doing unusable rollout compute.
+- **No cleanup.** slime cannot know when consumers finish reading a version, so `--update-weight-delta-keep-files` is required and version-directory lifecycle belongs to you (e.g. the publish hook can prune old versions once uploaded).
+- **No-op versions still publish.** If a sync produces no changed bytes, the hook is still called with an empty file list so consumers' version counters can advance.
+
+`--update-weight-delta-root` optionally names a root directory for publish-side metadata; it defaults to the parent of `--update-weight-disk-dir` and is passed through to hooks via `args`.
+
 ## Encoding Choice
 
 `--update-weight-encoding` picks how positions are packed. All three share the same on-wire layout (`__positions__` uint8 blob + `__values__` tensor + per-param manifest); decoder dispatches on the metadata.

diff --git a/docs/en/advanced/external-rollout-engines.md b/docs/en/advanced/external-rollout-engines.md
@@ -2,13 +2,15 @@
 
 An external rollout engine is an SGLang engine that is not launched by the slime training job. Another system deploys and owns the engine lifecycle; slime connects to those engines during training, registers a router, and syncs updated actor weights when needed.
 
-This page is a roadmap. Use it to decide when to use `--rollout-external-engine-addrs`, when to stay with `--sglang-config`, and which weight-update path to pick for external deployments.
+This page is a roadmap. Use it to decide when to use `--rollout-external-engine-addrs`, when to use `--rollout-http-endpoint-url`, when to stay with `--sglang-config`, and which weight-update path to pick for external deployments.
 
 ## Where To Start
 
 | Goal | Recommended entry point |
 | :--- | :--- |
 | Engines are already launched externally and slime should only connect for rollout | `--rollout-external-engine-addrs` |
+| Rollout serving is an elastic fleet behind a single HTTP URL, with no stable per-engine handles | `--rollout-http-endpoint-url` |
+| The serving side pulls published weight versions instead of receiving direct update RPCs | `--update-weight-delta-publish-only`, see [Publish-Only Disk Delta](delta-weight-sync.md#publish-only-disk-delta) |
 | slime should still launch engines, but you need PD disaggregation, multi-model serving, heterogeneous server groups, or per-group overrides | [SGLang Config](sglang-config.md) |
 | Trainer and external engines can form an NCCL group | Default `--update-weight-mode full --update-weight-transport nccl` |
 | Trainer and external engines cannot form an NCCL group, but can see the same filesystem path | `--update-weight-mode full --update-weight-transport disk` |
@@ -38,6 +40,27 @@ slime queries each engine's `/server_info` or `/get_server_info` endpoint and in
 
 This path fits deployments where serving is owned outside the training job: a separate inference cluster, a separate Ray cluster, manually warmed SGLang engines, or a rollout service managed by another orchestrator.
 
+## Opaque HTTP Rollout Endpoint
+
+`--rollout-external-engine-addrs` still assumes SGLang engines with stable addresses: slime queries `/server_info` per engine, registers each one with a router, and pushes weight updates to known engine handles. Some deployments cannot offer that contract — for example a serverless or autoscaled inference fleet behind one URL, where workers come and go and no worker-management API is exposed. For those, point slime at the endpoint directly:
+
+```bash
+python train.py \
+  --rollout-http-endpoint-url https://rollout.example.com \
+  ...
+```
+
+In this mode slime launches no engines and no router, and assumes nothing about the endpoint beyond the generation route: rollout requests POST to `{url}/generate`, and `get_model_url(args, ...)` in custom rollout functions resolves to the endpoint as well. No rollout GPUs are allocated in the placement group, `/server_info` is never queried, and slime fault tolerance does not manage the fleet — recovery is the endpoint operator's job. `--rollout-http-endpoint-url` and `--rollout-external-engine-addrs` are mutually exclusive.
+
+Two companion flags adapt the default SGLang rollout to an endpoint that lacks router APIs:
+
+- `--rollout-http-endpoint-abort-strategy {cancel-only,router-workers}`: how `abort` behaves between rollouts. `cancel-only` (the default when an endpoint URL is set) cancels slime's local pending generation tasks without calling the router's worker-list or per-worker abort APIs. `router-workers` keeps the existing router-based abort and remains the default otherwise. Note that `cancel-only` does not collect partial samples, so it does not compose with `--partial-rollout`.
+- `--custom-rollout-request-hook-path`: optional hook called before each default SGLang `/generate` request. Signature: `def hook(args, sample, request) -> None | dict`. The `request` dict contains `url`, `payload`, `headers`, `max_retries`, `retry_sleep`, `rollout_id`, and `evaluation`; mutate it in place or return a dict of updates.
+
+Use the request hook for rollout-endpoint admission control. For example, a hook may attach `"weight_version": {"exact_version": <ready_version>}` or `"weight_version": {"min_required_version": <minimum_version>}` and increase `max_retries`/`retry_sleep`. Those request fields avoid wasted rollout compute when an opaque router sends the request to a replica that has not loaded a usable version yet. They do not define SLIME's off-policy or staleness semantics; the trainer schedule and loss/correction path still decide which versions are valid.
+
+For weight sync, an elastic fleet usually cannot receive per-engine `update_weights_from_disk` RPCs either. Combine the endpoint with publish-only delta sync, where the trainer publishes each complete weight version through a custom hook and the serving side consumes it on its own schedule — see [Publish-Only Disk Delta](delta-weight-sync.md#publish-only-disk-delta). If request-level minimum-version retry is enough, leave publish-only in its default pipelined mode. If the publish hook polls rollout-fleet status and you want the next rollout dispatch to wait for that readiness threshold, set `--update-weight-delta-publish-wait=sync`.
+
 ## Relationship With `--sglang-config`
 
 `--rollout-external-engine-addrs` and `--sglang-config` are mutually exclusive because they own different boundaries:
@@ -108,8 +131,9 @@ For encoding choices, wire layout, receiver-side selective overwrite, and tuning
 - External engines can use an independent SGLang environment; they do not need the slime or Megatron training environment.
 - Disk transport supports different GPU models or vendors between training and rollout, as long as SGLang supports the target hardware and model format.
 - Disk transport requires trainer and SGLang engines to see the same `--update-weight-disk-dir` path; a path visible only to the trainer is not enough.
-- External engines are not recovered by slime fault tolerance; their lifecycle belongs to the external deployment system.
-- `--sglang-config` and `--rollout-external-engine-addrs` are mutually exclusive.
+- External engines are not recovered by slime fault tolerance; their lifecycle belongs to the external deployment system. The same applies to fleets behind `--rollout-http-endpoint-url`.
+- `--sglang-config` and `--rollout-external-engine-addrs` are mutually exclusive, as are `--rollout-external-engine-addrs` and `--rollout-http-endpoint-url`.
+- An opaque HTTP endpoint only needs to serve the generation route; worker-management APIs are never called. If the fleet cannot accept direct weight-update RPCs, use publish-only delta sync.
 - Delta mode does not support `--colocate`, because colocated sync uses CUDA IPC handles and delta encoding does not reduce the actual transfer.
 
 ## Related Work

diff --git a/docs/en/get_started/customization.md b/docs/en/get_started/customization.md
@@ -28,6 +28,7 @@ Below is a summary of all available customization interfaces and their purposes.
 | [`--custom-megatron-init-path`](#17-megatron-hooks) | Custom initialization after Megatron setup. |
 | [`--custom-megatron-before-log-prob-hook-path`](#17-megatron-hooks) | Custom logic before log probability computation. |
 | [`--custom-megatron-before-train-step-hook-path`](#17-megatron-hooks) | Custom logic before each training step. |
+| [`--custom-rollout-request-hook-path`](#19-rollout-request-hook---custom-rollout-request-hook-path) | Customize each default SGLang `/generate` request before dispatch. |
 
 ## Agentic workflows through customization interfaces
 
@@ -457,6 +458,25 @@ Stabilize MoE RL training by recording and replaying expert routing decisions to
 | `--use-routing-replay` | Forward-backward routing consistency in training. ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) |
 | `--use-rollout-routing-replay` | R3: Replay routing from rollout during training. Supported by slime's default `sglang_router` path. ([arXiv:2510.11370](https://arxiv.org/abs/2510.11370)) |
 
+---
+
+### 19. Rollout Request Hook (`--custom-rollout-request-hook-path`)
+
+**Signature**:
+```python
+def hook(args, sample, request) -> None | dict
+```
+
+**Purpose**: Customize each default SGLang rollout `/generate` request before it
+is sent. `request` contains `url`, `payload`, `headers`, `max_retries`,
+`retry_sleep`, `rollout_id`, and `evaluation`. Mutate it in place or return a
+dict of updates.
+
+This hook is useful for external rollout providers that need request-level
+admission control, for example adding `payload["weight_version"]` so a request
+routed to a lagging replica fails and retries before doing unusable rollout
+compute.
+
 ## Testing Custom Function Paths
 
 slime also provides CPU-only contract tests for customization interfaces. These tests resolve components through import-path strings, so they can validate both built-in hooks and user-defined implementations passed through the same CLI arguments used by training.
@@ -470,7 +490,7 @@ The tests live under `tests/plugin_contracts/` and are grouped by hook shape:
 - `tests/plugin_contracts/test_plugin_path_loading_contracts.py`
   Covers `--eval-function-path`, `--custom-rm-path`, `--dynamic-sampling-filter-path`, `--buffer-filter-path`, `--data-source-path`, `--rollout-sample-filter-path`, and `--rollout-all-samples-process-path`
 - `tests/plugin_contracts/test_plugin_runtime_hook_contracts.py`
-  Covers `--custom-rollout-log-function-path`, `--custom-eval-rollout-log-function-path`, `--custom-reward-post-process-path`, `--custom-convert-samples-to-train-data-path`, and `--rollout-data-postprocess-path`
+  Covers `--custom-rollout-log-function-path`, `--custom-eval-rollout-log-function-path`, `--custom-reward-post-process-path`, `--custom-convert-samples-to-train-data-path`, `--rollout-data-postprocess-path`, and `--custom-rollout-request-hook-path`
 
 Run all customization contract tests locally:
 

diff --git a/docs/zh/advanced/delta-weight-sync.md b/docs/zh/advanced/delta-weight-sync.md
@@ -4,6 +4,7 @@
 - [快速开始](#快速开始)
 - [同步模式与传输方式](#同步模式与传输方式)
 - [工作原理](#工作原理)
+- [Publish-Only 磁盘 Delta](#publish-only-磁盘-delta)
 - [编码选择](#编码选择)
 - [为何不支持 colocated](#为何不支持-colocated)
 
@@ -88,6 +89,35 @@ Delta NCCL 和 delta 磁盘共用同一条发送管线、同一种 wire 布局
 
 选择性覆写没有任何算术运算 —— 接收端在变化位置直接写入训练端的精确字节 —— 因此天然无损，也不存在数值漂移问题，无需周期性 base 同步。
 
+## Publish-Only 磁盘 Delta
+
+上面的磁盘路径把每个版本推送给已知 engine：rank 0 调用每个 engine 的 `update_weights_from_disk(load_format="delta")`，所有 engine 确认后同步才结束。这要求 engine 句柄稳定。当 serving 侧是一个按自己节奏消费已发布版本的弹性集群——例如位于 [opaque HTTP rollout endpoint](external-rollout-engines.md#opaque-http-rollout-endpoint) 之后——可以用 publish-only 模式反转方向：
+
+```bash
+--update-weight-mode delta
+--update-weight-transport disk
+--update-weight-delta-publish-only
+--custom-delta-publish-path my_pkg.publish.publish_delta
+--update-weight-delta-keep-files
+```
+
+rank 0 不再发出 per-engine RPC，而是在每次同步中调用一次你的 publish hook——此时所有 delta 文件已经写完，可选的 `--custom-delta-pre-push-path` hook 也已提交：
+
+```python
+def publish_delta(args, version_dir: str, files: list[str], weight_version: str, rollout_engines) -> list | None:
+    ...  # 例如把 version_dir 上传到对象存储，然后公告 weight_version
+```
+
+返回的 Ray ObjectRef 会在该版本视为完成之前被等待。与直接磁盘路径的行为差异：
+
+- **每次同步发布一个完整版本。** 直接磁盘传输在每个 pass 边界发布，让接收端的 apply 与后续编码重叠；publish-only 把所有发布推迟到 finalize，外部消费者永远不会看到只发布了一半的版本。
+- **发布等待可配置。** 默认 `--update-weight-delta-publish-wait=next-sync` 会让已派发的 publish 在下一个训练 step 期间保持 in flight，并在下一次同步开始时（或 disconnect 时）结算。因此 publish 失败会晚一个同步周期才在 rank 0 上暴露。如果 publish hook 会轮询外部 rollout 集群、并且希望下一次 rollout dispatch 等到足够副本就绪后再开始，可以设置 `--update-weight-delta-publish-wait=sync`。
+- **不打扰 engine。** 不暂停生成、不清空 cache、不发出任何 update RPC；消费者自己决定何时拉取新版本。如果 rollout endpoint 支持请求级权重约束，可以在 `--custom-rollout-request-hook-path` hook 中附加这些约束，让路由到落后副本的请求尽早失败并重试，避免生成不可用样本。
+- **不做清理。** slime 无法知道消费者何时读完一个版本，所以必须加 `--update-weight-delta-keep-files`，版本目录的生命周期由你负责（例如 publish hook 可以在上传完成后清理旧版本）。
+- **空 delta 也会发布。** 如果某次同步没有任何字节变化，hook 仍会以空文件列表被调用，让消费者的版本计数得以推进。
+
+`--update-weight-delta-root` 可选地指定发布侧元数据的根目录；缺省为 `--update-weight-disk-dir` 的父目录，并通过 `args` 透传给 hook。
+
 ## 编码选择
 
 `--update-weight-encoding` 决定位置如何打包。三种编码共用同一种 wire 布局（`__positions__` uint8 块 + `__values__` 张量 + per-param manifest），解码端根据 metadata 分派。