Skip to content

feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output#4301

Merged
Meirtz merged 2 commits into
NVIDIA-NeMo:mainfrom
Meirtz:fix/dsv4-bf16-export
Jun 22, 2026
Merged

feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output#4301
Meirtz merged 2 commits into
NVIDIA-NeMo:mainfrom
Meirtz:fix/dsv4-bf16-export

Conversation

@Meirtz

@Meirtz Meirtz commented Jun 11, 2026

Copy link
Copy Markdown
Member

What

Thread weight_dtype: Optional[torch.dtype] = None through the HF export path — export_hf_weights / save_hf_pretrained / stream_weights_megatron_to_hf — carried per-task via a new optional WeightConversionTask.weight_dtype field. When set, the DeepSeek-V4 bridge emits plain weights in that dtype (no *.scale companions) instead of re-creating the source repo's quantized layout. Default (None) keeps today's behavior. CLI: --export-weight-dtype on the export subcommand.

Why

DSv4 HF export unconditionally re-creates the source repo's quantized weight/scale layout (maybe_modify_converted_hf_weightrequantize_hf_weight_scale_pairs, from #3969). That's right for checkpoint conversion, but bf16-SFT'd weights get silently post-hoc quantized — a user found *.scale tensors in their SFT export and asked about train/inference parity.

Design (revised after reviewer feedback): the requantize hook runs on both export consumers — online weight streaming to rollout engines (export_hf_weights, e.g. verl RL weight sync) and on-disk checkpoints (save_hf_pretrained) — so a bridge-level boolean cannot configure them independently. A dtype-typed parameter on each public API lets callers choose per path (e.g. bf16 to rollout for RL parity, quantized to disk for serving-format artifacts, or vice versa). Hook signatures are unchanged (the dtype rides on the task), so the other bridges overriding this hook (dsv3, gemma4, kimi, mimo, flux) are unaffected; DSv3 can adopt the same field later.

Verified

  • the saver streams only yielded tensors (omitting .scale keys is safe); exported config.json is built fresh (torch_dtype: bfloat16, no quantization fields); safetensors index regenerated from written tensors;
  • unit tests cover dtype-set (plain weights, non-float tensors untouched) and default (requantize) paths.

Notes

  • AI-assisted (Claude); analysis, validation and review by the human author.

🤖 Generated with Claude Code

@Meirtz Meirtz added feature New capabilities, enhancements, or enablement work area:model Model implementations and HF bridge logic labels Jun 11, 2026
@yaoyu-33 yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label Jun 11, 2026
@Meirtz Meirtz changed the title feat(model): optional bf16 (non-quantized) HF export for DeepSeek-V4 feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output Jun 11, 2026
@Meirtz

Meirtz commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Reworked per reviewer feedback (offline discussion): the hook serves two export consumers — online weight streaming to rollout engines and on-disk checkpoints — so the bridge-level boolean is gone. Now weight_dtype: Optional[torch.dtype] on export_hf_weights / save_hf_pretrained, carried per-task (WeightConversionTask.weight_dtype), hook signatures unchanged. d10a8e7e.

@Meirtz

Meirtz commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Full-model E2E validation (DeepSeek-V4-Flash, 43 layers, real weights, TP1/PP4/EP8 on 8×GB300; same imported Megatron checkpoint for both runs):

export tensors .scale tensors dtypes
default (quantized) 69,187 34,167 F8_E4M3, F8_E8M0, I8 (MXFP4), BF16, F32, I32
--export-weight-dtype bfloat16 35,020 0 BF16, I32

35,020 + 34,167 = 69,187 — the bf16 artifact contains exactly every weight with no scale companions (I32 = tid2eid routing buffers, correctly left untouched).

Two notes from the run: (1) the smoke caught a real bug in the first version of this PR — WeightConversionTask is a frozen dataclass, so the dtype is now applied via dataclasses.replace (ce66e82), and the unit tests now use real task instances instead of mocks; (2) with weight_dtype set, the saver's completeness check counts the source's .scale entries as "not written" (export proceeds with --not-strict) — cosmetic, can be polished in a follow-up by excluding scale keys from the expected set.

@Meirtz

Meirtz commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/ok to test 1b93e3c

@cuichenx cuichenx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/model_bridge.py Outdated

@cuichenx cuichenx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check comments above

@Meirtz

Meirtz commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/ok to test b577be3

…tput

Export has two consumers — online weight sync for RL rollout
(export_hf_weights) and on-disk checkpoints (save_hf_pretrained). Each
gains an optional weight_dtype that flows through WeightConversionTask
into the export stream.

Per review (HollowMan6): the plain-dtype cast is now generic, not
DSv4-only. build_conversion_tasks stamps weight_dtype onto each task
(no post-hoc dataclasses.replace except for caller-supplied tasks), and
the cast lives in the shared stream path covering both the standard and
grouped-export branches. The DSv4 hook simply skips requantization when
weight_dtype is set and returns the converted weights unchanged, letting
the generic path cast the dtype — keeping plain-dtype export identical
across bridges. Adds --export-weight-dtype to the multi-gpu convert
example.

Validated end-to-end on 32x GB300: bf16 export = 35020 tensors / 0
scales; quantized export = 69187 / 34167.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Lingrui Mei <lmei@nvidia.com>
@Meirtz

Meirtz commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/ok to test 5a97742

@HollowMan6 HollowMan6 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Meirtz ! Now the code logic looks much better and I don't have more comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants