Hi, thank you for releasing this great framework!
I am running Slime GRPO training on Qwen3-1.7B-Base, and I observe that Slime is consistently ~4× slower than VERL, even when using the same dataset, same rollout size (128×32), and same number of GPUs (8×H800).
Even in the early stage of training, when the average response length is only ~1000 tokens, a single training step still takes more than 10 minutes, which is significantly slower than VERL under the same setup.
The slowdown seems to come mainly from two metrics:
- High Wait Time (~50% of step time)
WandB shows perf/wait_time_ratio ≈ 0.5, meaning half of the time is spent waiting rather than computing.
- Rollout Throughput Much Lower Than VERL
perf/rollout_time is significantly higher than verl.
This makes total training throughput roughly one quarter of VERL on the same machine.
Below is the launch script I modified based on the official example.
It is possible that I misconfigured certain parts, causing the framework to not reach its expected performance.
I would really appreciate it if you could take a look and let me know whether this setup is reasonable, and what configuration you would recommend for Qwen3-1.7B-Base under this setting.
Thank you very much for your help!
ROLLOUT_ARGS=(
--prompt-data ./datasets/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 128
--n-samples-per-prompt 32
--rollout-max-response-len 16384
--rollout-temperature 0.8
--global-batch-size 4096
--balance-data
)
EVAL_ARGS=(
--eval-interval 10
--eval-prompt-data aime ./datasets/aime-2024.jsonl
--n-samples-per-eval-prompt 8
--eval-max-response-len 16384
--eval-top-p 0.7
)
PERF_ARGS=(
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)
WANDB_ARGS=(
--use-wandb
--wandb-project Math-RL
--wandb-group Qwen3-1.7B-8xgpu-128bs-32n
--wandb-key ${WANDB_API_KEY}
)
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 2
--use-slime-router
--sglang-mem-fraction-static 0.8
)
MISC_ARGS=(
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
)
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --num-cpus 32 --dashboard-host=0.0.0.0 --dashboard-port=8265
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/slime-build/Megatron-LM/\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
}
}"
ray job submit --address="xxxxx:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
--no-wait \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 2 \
--rollout-num-gpus 6 \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
Hi, thank you for releasing this great framework!
I am running Slime GRPO training on Qwen3-1.7B-Base, and I observe that Slime is consistently ~4× slower than VERL, even when using the same dataset, same rollout size (128×32), and same number of GPUs (8×H800).
Even in the early stage of training, when the average response length is only ~1000 tokens, a single training step still takes more than 10 minutes, which is significantly slower than VERL under the same setup.
The slowdown seems to come mainly from two metrics:
WandB shows perf/wait_time_ratio ≈ 0.5, meaning half of the time is spent waiting rather than computing.
perf/rollout_time is significantly higher than verl.
This makes total training throughput roughly one quarter of VERL on the same machine.
Below is the launch script I modified based on the official example.
It is possible that I misconfigured certain parts, causing the framework to not reach its expected performance.
I would really appreciate it if you could take a look and let me know whether this setup is reasonable, and what configuration you would recommend for Qwen3-1.7B-Base under this setting.
Thank you very much for your help!