Skip to content

Slime RL Training is 4× Slower Than VERL on the Same Model #1072

@chenyuxin1999

Description

@chenyuxin1999

Hi, thank you for releasing this great framework!

I am running Slime GRPO training on Qwen3-1.7B-Base, and I observe that Slime is consistently ~4× slower than VERL, even when using the same dataset, same rollout size (128×32), and same number of GPUs (8×H800).

Even in the early stage of training, when the average response length is only ~1000 tokens, a single training step still takes more than 10 minutes, which is significantly slower than VERL under the same setup.

The slowdown seems to come mainly from two metrics:

  1. High Wait Time (~50% of step time)

WandB shows perf/wait_time_ratio ≈ 0.5, meaning half of the time is spent waiting rather than computing.

  1. Rollout Throughput Much Lower Than VERL

perf/rollout_time is significantly higher than verl.

This makes total training throughput roughly one quarter of VERL on the same machine.

Below is the launch script I modified based on the official example.
It is possible that I misconfigured certain parts, causing the framework to not reach its expected performance.
I would really appreciate it if you could take a look and let me know whether this setup is reasonable, and what configuration you would recommend for Qwen3-1.7B-Base under this setting.

Thank you very much for your help!

ROLLOUT_ARGS=(
   --prompt-data ./datasets/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle

   --rm-type deepscaler

   --num-rollout 3000
   --rollout-batch-size 128
   --n-samples-per-prompt 32
   --rollout-max-response-len 16384
   --rollout-temperature 0.8

   --global-batch-size 4096
   --balance-data
)

EVAL_ARGS=(
   --eval-interval 10
   --eval-prompt-data aime ./datasets/aime-2024.jsonl
   --n-samples-per-eval-prompt 8
   --eval-max-response-len 16384
   --eval-top-p 0.7
)

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 9216
)

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

WANDB_ARGS=(
   --use-wandb
   --wandb-project Math-RL
   --wandb-group Qwen3-1.7B-8xgpu-128bs-32n
   --wandb-key ${WANDB_API_KEY}
)

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --use-slime-router
   --sglang-mem-fraction-static 0.8
)

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash
)

ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --num-cpus 32 --dashboard-host=0.0.0.0 --dashboard-port=8265

RUNTIME_ENV_JSON="{
  \"env_vars\": {
    \"PYTHONPATH\": \"/slime-build/Megatron-LM/\",
    \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
    \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
  }
}"

ray job submit --address="xxxxx:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   --no-wait \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 2 \
   --rollout-num-gpus 6 \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${WANDB_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions