feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output#4301
Conversation
|
Reworked per reviewer feedback (offline discussion): the hook serves two export consumers — online weight streaming to rollout engines and on-disk checkpoints — so the bridge-level boolean is gone. Now |
|
Full-model E2E validation (DeepSeek-V4-Flash, 43 layers, real weights, TP1/PP4/EP8 on 8×GB300; same imported Megatron checkpoint for both runs):
35,020 + 34,167 = 69,187 — the bf16 artifact contains exactly every weight with no scale companions (I32 = Two notes from the run: (1) the smoke caught a real bug in the first version of this PR — |
ce66e82 to
1b93e3c
Compare
|
/ok to test 1b93e3c |
cuichenx
left a comment
There was a problem hiding this comment.
please check comments above
f1bcb78 to
0fe66bf
Compare
0fe66bf to
f7c0987
Compare
f7c0987 to
b577be3
Compare
|
/ok to test b577be3 |
…tput Export has two consumers — online weight sync for RL rollout (export_hf_weights) and on-disk checkpoints (save_hf_pretrained). Each gains an optional weight_dtype that flows through WeightConversionTask into the export stream. Per review (HollowMan6): the plain-dtype cast is now generic, not DSv4-only. build_conversion_tasks stamps weight_dtype onto each task (no post-hoc dataclasses.replace except for caller-supplied tasks), and the cast lives in the shared stream path covering both the standard and grouped-export branches. The DSv4 hook simply skips requantization when weight_dtype is set and returns the converted weights unchanged, letting the generic path cast the dtype — keeping plain-dtype export identical across bridges. Adds --export-weight-dtype to the multi-gpu convert example. Validated end-to-end on 32x GB300: bf16 export = 35020 tensors / 0 scales; quantized export = 69187 / 34167. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>
|
/ok to test 5a97742 |
HollowMan6
left a comment
There was a problem hiding this comment.
Thank you @Meirtz ! Now the code logic looks much better and I don't have more comments
What
Thread
weight_dtype: Optional[torch.dtype] = Nonethrough the HF export path —export_hf_weights/save_hf_pretrained/stream_weights_megatron_to_hf— carried per-task via a new optionalWeightConversionTask.weight_dtypefield. When set, the DeepSeek-V4 bridge emits plain weights in that dtype (no*.scalecompanions) instead of re-creating the source repo's quantized layout. Default (None) keeps today's behavior. CLI:--export-weight-dtypeon the export subcommand.Why
DSv4 HF export unconditionally re-creates the source repo's quantized weight/scale layout (
maybe_modify_converted_hf_weight→requantize_hf_weight_scale_pairs, from #3969). That's right for checkpoint conversion, but bf16-SFT'd weights get silently post-hoc quantized — a user found*.scaletensors in their SFT export and asked about train/inference parity.Design (revised after reviewer feedback): the requantize hook runs on both export consumers — online weight streaming to rollout engines (
export_hf_weights, e.g. verl RL weight sync) and on-disk checkpoints (save_hf_pretrained) — so a bridge-level boolean cannot configure them independently. A dtype-typed parameter on each public API lets callers choose per path (e.g. bf16 to rollout for RL parity, quantized to disk for serving-format artifacts, or vice versa). Hook signatures are unchanged (the dtype rides on the task), so the other bridges overriding this hook (dsv3, gemma4, kimi, mimo, flux) are unaffected; DSv3 can adopt the same field later.Verified
.scalekeys is safe); exportedconfig.jsonis built fresh (torch_dtype: bfloat16, no quantization fields); safetensors index regenerated from written tensors;Notes
🤖 Generated with Claude Code