feat(rollouts) external rollouts endpoint with publish-only weight sync#2071
Open
jvmncs wants to merge 5 commits into
Open
feat(rollouts) external rollouts endpoint with publish-only weight sync#2071jvmncs wants to merge 5 commits into
jvmncs wants to merge 5 commits into
Conversation
In publish-only mode, _finalize_sync now dispatches the publish hook without awaiting its refs; the start of the next sync (or disconnect_rollout_engines) drains them, so the publish overlaps a full training step with at most one version outstanding. Failures surface one sync late. Direct disk transport still drains before cleanup and resume.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
slime currently assumes it owns the rollout backend: it launches or registers SGLang engines (or fixed
--rollout-external-engine-addrs) and pushes weight updates to each engine via per-engine RPCs. This PR lets slime train against an elastic, externally managed inference fleet behind a single HTTP endpoint — one with no stable per-engine handles and no SGLang-router worker-management APIs, where workers may scale up/down mid-run. Three opt-in pieces compose to support this:/generaterequests to a base URL without launching or registering any SGLang workers.update_weights_from_diskRPCs, the trainer publishes each complete delta version through a custom hook (e.g. to shared storage that the endpoint's workers consume), and the publish overlaps the next training step.All features are off by default; existing behavior is unchanged when the new flags are unset.
Modifications
Opaque HTTP rollout endpoint (
--rollout-http-endpoint-url)slime/backends/sglang_utils/http_endpoint.py: URL normalization/validation andHttpEndpointRolloutServer, a no-engine rollout server stub (no offload/onload, no fault-tolerance recover).slime/ray/rollout.pyreturns it from the server-startup path;get_model_url()insglang_rollout.pyreturns the endpoint URL (with the requested route appended) and never assumes router APIs exist.slime/ray/placement_group.pyallocates no rollout GPUs in this mode.--rollout-external-engine-addrs(validated).--rollout-http-endpoint-abort-strategy {cancel-only,router-workers}:cancel-only(the default when an endpoint is set) cancels local pending tasks without calling router/list_workers; the existing router-based abort is refactored into_drain_aborted_pending_tasksand remains the default otherwise.Version-pinned rollout requests (
--rollout-weight-version-policy exact-rollout-id)/generatepayloads gainweight_version={"exact_version": rollout_id}, scoped per rollout viarollout_weight_version_context.--rollout-weight-version-retry-attempts/-sleep;slime/utils/http_utils.pypost()gains aretry_sleepparameter.Publish-only disk delta sync (
--update-weight-delta-publish-only), and skips version-dir cleanup (--update-weight-delta-keep-filesis required).disconnect_rollout_engines), so publish latency overlaps training; a failed publish surfaces one sync late on rank 0.Tests
tests/test_rollout_http_endpoint.py(8 tests: URL validation, endpoint routing, payload pinning, retry-until-version-available, cancel-only abort, no-engine server),tests/test_delta_publish_only.py(4 tests: hook invocation without engine RPCs or cleanup, no-op version publish, drain-on-disconnect, publish deferred to finalize).tests/test_placement_group.py(http-endpoint layouts),tests/test_megatron_argument_validation.py.Checklist