Skip to content

feat(rollouts) external rollouts endpoint with publish-only weight sync#2071

Open
jvmncs wants to merge 5 commits into
THUDM:mainfrom
modal-projects:jvmncs/rollout-endpoint
Open

feat(rollouts) external rollouts endpoint with publish-only weight sync#2071
jvmncs wants to merge 5 commits into
THUDM:mainfrom
modal-projects:jvmncs/rollout-endpoint

Conversation

@jvmncs

@jvmncs jvmncs commented Jun 12, 2026

Copy link
Copy Markdown

Motivation

slime currently assumes it owns the rollout backend: it launches or registers SGLang engines (or fixed --rollout-external-engine-addrs) and pushes weight updates to each engine via per-engine RPCs. This PR lets slime train against an elastic, externally managed inference fleet behind a single HTTP endpoint — one with no stable per-engine handles and no SGLang-router worker-management APIs, where workers may scale up/down mid-run. Three opt-in pieces compose to support this:

  1. Opaque HTTP rollout endpoint — send /generate requests to a base URL without launching or registering any SGLang workers.
  2. Version-pinned rollout requests — generation payloads can pin an exact weight version, so an elastic fleet only serves samples from the intended policy version.
  3. Publish-only disk delta sync — instead of direct update_weights_from_disk RPCs, the trainer publishes each complete delta version through a custom hook (e.g. to shared storage that the endpoint's workers consume), and the publish overlaps the next training step.

All features are off by default; existing behavior is unchanged when the new flags are unset.

Modifications

Opaque HTTP rollout endpoint (--rollout-http-endpoint-url)

  • New slime/backends/sglang_utils/http_endpoint.py: URL normalization/validation and HttpEndpointRolloutServer, a no-engine rollout server stub (no offload/onload, no fault-tolerance recover).
  • slime/ray/rollout.py returns it from the server-startup path; get_model_url() in sglang_rollout.py returns the endpoint URL (with the requested route appended) and never assumes router APIs exist.
  • slime/ray/placement_group.py allocates no rollout GPUs in this mode.
  • Mutually exclusive with --rollout-external-engine-addrs (validated).
  • --rollout-http-endpoint-abort-strategy {cancel-only,router-workers}: cancel-only (the default when an endpoint is set) cancels local pending tasks without calling router /list_workers; the existing router-based abort is refactored into _drain_aborted_pending_tasks and remains the default otherwise.

Version-pinned rollout requests (--rollout-weight-version-policy exact-rollout-id)

  • /generate payloads gain weight_version={"exact_version": rollout_id}, scoped per rollout via rollout_weight_version_context.
  • Retries while the target version is unavailable are tunable via --rollout-weight-version-retry-attempts/-sleep; slime/utils/http_utils.py post() gains a retry_sleep parameter.

Publish-only disk delta sync (--update-weight-delta-publish-only), and skips version-dir cleanup (--update-weight-delta-keep-files is required).

  • The dispatched publish intentionally stays in flight across the training step and is drained at the start of the next sync (or on disconnect_rollout_engines), so publish latency overlaps training; a failed publish surfaces one sync late on rank 0.
  • Argument validation enforces delta mode + disk transport + publish path + keep-files.

Tests

  • New: tests/test_rollout_http_endpoint.py (8 tests: URL validation, endpoint routing, payload pinning, retry-until-version-available, cancel-only abort, no-engine server), tests/test_delta_publish_only.py (4 tests: hook invocation without engine RPCs or cleanup, no-op version publish, drain-on-disconnect, publish deferred to finalize).
  • Extended: tests/test_placement_group.py (http-endpoint layouts), tests/test_megatron_argument_validation.py.

Checklist

  • Format your code
  • Add unit tests according — 12 new tests + 3 extended parametrizations, all passing
  • Update documentation
  • Provide accuracy and speed benchmark results

jvmncs added 5 commits June 12, 2026 12:44
In publish-only mode, _finalize_sync now dispatches the publish hook
without awaiting its refs; the start of the next sync (or
disconnect_rollout_engines) drains them, so the publish overlaps a full
training step with at most one version outstanding. Failures surface
one sync late. Direct disk transport still drains before cleanup and
resume.
@jvmncs jvmncs changed the title Implement external rollouts endpoint with publish-only weight sync (feat) external rollouts endpoint with publish-only weight sync Jun 12, 2026
@jvmncs jvmncs changed the title (feat) external rollouts endpoint with publish-only weight sync feat(rollouts) external rollouts endpoint with publish-only weight sync Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant