DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix by yeyu-nvidia · Pull Request #1749 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-06-16T18:04:16Z

Tracking PR for DFlash on MiniMax-M3, reusing the M2.7 DFlash work merged in #1621. Opening early with the first piece — the data-synthesis tooling — more to follow (offline recipe folder, mask-token / YaRN config for M3).

This PR so far

Per-conversation thinking-mode mix in data synthesis (server_generate.py + distributed_generate/worker.sh):

M3 has three thinking modes (enabled / disabled / adaptive). To train a DFlash draft that generalizes across all of them, synthetic conversations should span the mix.
server_generate.py gains --thinking-modes enabled,disabled,adaptive: conversation i uses modes[i % len(modes)] (even split), passed via chat_template_kwargs, and the chosen mode is recorded on each output record. Default empty → unchanged for models without thinking modes.
We intentionally do not enable the server's --reasoning-parser for synthesis, so the full <mm:think>+answer lands in content (the draft must learn to draft the entire generated sequence).
worker.sh threads THINKING_MODES through and adds VLLM_SERVE_EXTRA_ARGS / SGLANG_SERVE_EXTRA_ARGS passthroughs for model-specific serve flags.

M3 serving notes (validated on H100)

MiniMaxAI/MiniMax-M3-MXFP8 serves single-node TP8 on H100 with image vllm/vllm-openai:minimax-m3, --block-size 128 (mandatory for MSA sparse attention) and --language-model-only (text-only). KV cache must stay bf16 — M3's MSA fused kernel rejects fp8 KV.

To come

examples/.../MiniMax/MiniMax-M3-DFlash/ offline recipe (mirroring M2.7, offline path).
M3 dflash_mask_token_id=200061 (M3's 200054 is now a real special token) + YaRN export config.

🤖 Generated with Claude Code

…niMax-M3) Adds a --thinking-modes cycle to server_generate.py so synthetic conversations are generated across a mix of thinking modes — e.g. MiniMax-M3's enabled/disabled/adaptive, passed via chat_template_kwargs — so a DFlash/EAGLE draft trained on the data generalizes across modes. Conversation i uses modes[i % len(modes)] for an even split; the mode is recorded on each output record. Empty (default) sends no thinking_mode, unchanged for models without it. distributed_generate/worker.sh: pass THINKING_MODES through to server_generate.py, and add VLLM_SERVE_EXTRA_ARGS / SGLANG_SERVE_EXTRA_ARGS passthroughs for model-specific serve flags (M3 needs --block-size 128 for MSA sparse attention and --language-model-only for text-only synthesis; KV cache stays bf16 — M3's MSA fused kernel rejects fp8 KV). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

copy-pr-bot · 2026-06-16T18:04:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-16T18:04:51Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 406af45f-c98a-4e64-8674-f10cd7bb2f49

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yeyu/dflash-minimax-m3

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-16T18:20:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.75%. Comparing base (e6790ef) to head (0aa2880).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1749      +/-   ##
==========================================
- Coverage   77.12%   75.75%   -1.37%     
==========================================
  Files         511      511              
  Lines       56273    58061    +1788     
==========================================
+ Hits        43399    43985     +586     
- Misses      12874    14076    +1202

Flag	Coverage Δ
unit	`54.39% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tput While wiring MiniMax-M3 synthesis (prompt set Speculative-Decoding-Dataset-v2 is OAI-format): - Accept both 'conversations' (ShareGPT) and 'messages' (OAI) prompt datasets on input (previously KeyError'd on 'messages'). - Add --output-format {oai,sharegpt} (default oai): emit the OpenAI standard {'messages': [{role, content}, ...]} instead of the legacy {'conversations': [...]}. Pass --output-format sharegpt for the old key. Validated end-to-end on MiniMax-M3-MXFP8 (single-node TP8 H100): the 3-way thinking-mode mix renders correctly (disabled -> direct answer; adaptive -> <mm:think> reasoning captured in content since no reasoning-parser), OAI in/out flows into the vLLM hidden-state dump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

…747) Validated extract_hidden_states on MiniMax-M3-MXFP8 (single-node TP8 H100). Required M3 enablement in compute_hidden_states_vllm.py: - --block-size (M3's MSA sparse attention mandates 128; default None elsewhere). - --enforce-eager: M3's MSA Triton kernel (_gqa_sparse_fwd_kernel) JIT-recompiles per input shape; under cudagraph capture a recompile blows the executor RPC timeout and hangs the engine (sample_tokens timeout). Eager mode + a long VLLM_RPC_TIMEOUT fixes it. - --language-model-only: skip the vision encoder for text-only dumps (M3 is VL). - Read num_hidden_layers from text_config/llm_config for wrapped VL configs (MiniMaxM3VLConfig nests it; previously raised 'no num_hidden_layers attribute'). Output verified: per-conv .pt with input_ids / hidden_states (T,6144) / aux_hidden_states / loss_mask (length-matched) / conversation_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

…e (OMNIML-4747) 2-step offline DFlash recipe for MiniMax-M3 (427B VL-MoE), mirroring MiniMax-M2.7-DFlash: - hf_offline_dflash.yaml: dump (vLLM extract_hidden_states, MXFP8 single-node TP8) + train (FakeBaseModel on bf16). M3-specific: --block-size 128 (MSA), --language-model-only, --enforce-eager + VLLM_RPC_TIMEOUT=1800000 (avoid MSA Triton-kernel JIT RPC-timeout hang), seq-len 8192 end-to-end, mask token 200061 (200054 is a real special token in M3), OVERRIDE_TRANSFORMERS 4.52.4, export-YaRN original_max_position 8192 / factor 24 (tunable; 128 for full 1M). - chat_template_train.jinja: M3 chat template with {% generation %} wrapping the assistant turn (think + content + tool_calls) for answer_only_loss; header + eos sit outside the span, matching the M2.7 convention. Thinking-mode handling preserved verbatim. Validated: generation spans cover exactly the assistant outputs across multi-turn + no-think turns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

…ctor 128) original_max_position_embeddings=8192 (training seq-len) x factor 128 = 1048576 = M3's full 1M context. Export-time tunable (factor 24 -> 196608 for the M2.7-equivalent target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

yeyu-nvidia and others added 4 commits June 16, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749

DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749
yeyu-nvidia wants to merge 5 commits into
mainfrom
yeyu/dflash-minimax-m3

yeyu-nvidia commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yeyu-nvidia commented Jun 16, 2026

This PR so far

M3 serving notes (validated on H100)

To come

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

codecov Bot commented Jun 16, 2026 •

edited

Loading