DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749
DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749yeyu-nvidia wants to merge 5 commits into
Conversation
…niMax-M3) Adds a --thinking-modes cycle to server_generate.py so synthetic conversations are generated across a mix of thinking modes — e.g. MiniMax-M3's enabled/disabled/adaptive, passed via chat_template_kwargs — so a DFlash/EAGLE draft trained on the data generalizes across modes. Conversation i uses modes[i % len(modes)] for an even split; the mode is recorded on each output record. Empty (default) sends no thinking_mode, unchanged for models without it. distributed_generate/worker.sh: pass THINKING_MODES through to server_generate.py, and add VLLM_SERVE_EXTRA_ARGS / SGLANG_SERVE_EXTRA_ARGS passthroughs for model-specific serve flags (M3 needs --block-size 128 for MSA sparse attention and --language-model-only for text-only synthesis; KV cache stays bf16 — M3's MSA fused kernel rejects fp8 KV). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1749 +/- ##
==========================================
- Coverage 77.12% 75.75% -1.37%
==========================================
Files 511 511
Lines 56273 58061 +1788
==========================================
+ Hits 43399 43985 +586
- Misses 12874 14076 +1202
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…tput
While wiring MiniMax-M3 synthesis (prompt set Speculative-Decoding-Dataset-v2 is OAI-format):
- Accept both 'conversations' (ShareGPT) and 'messages' (OAI) prompt datasets on input
(previously KeyError'd on 'messages').
- Add --output-format {oai,sharegpt} (default oai): emit the OpenAI standard
{'messages': [{role, content}, ...]} instead of the legacy {'conversations': [...]}.
Pass --output-format sharegpt for the old key.
Validated end-to-end on MiniMax-M3-MXFP8 (single-node TP8 H100): the 3-way thinking-mode
mix renders correctly (disabled -> direct answer; adaptive -> <mm:think> reasoning captured
in content since no reasoning-parser), OAI in/out flows into the vLLM hidden-state dump.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…747) Validated extract_hidden_states on MiniMax-M3-MXFP8 (single-node TP8 H100). Required M3 enablement in compute_hidden_states_vllm.py: - --block-size (M3's MSA sparse attention mandates 128; default None elsewhere). - --enforce-eager: M3's MSA Triton kernel (_gqa_sparse_fwd_kernel) JIT-recompiles per input shape; under cudagraph capture a recompile blows the executor RPC timeout and hangs the engine (sample_tokens timeout). Eager mode + a long VLLM_RPC_TIMEOUT fixes it. - --language-model-only: skip the vision encoder for text-only dumps (M3 is VL). - Read num_hidden_layers from text_config/llm_config for wrapped VL configs (MiniMaxM3VLConfig nests it; previously raised 'no num_hidden_layers attribute'). Output verified: per-conv .pt with input_ids / hidden_states (T,6144) / aux_hidden_states / loss_mask (length-matched) / conversation_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
…e (OMNIML-4747)
2-step offline DFlash recipe for MiniMax-M3 (427B VL-MoE), mirroring MiniMax-M2.7-DFlash:
- hf_offline_dflash.yaml: dump (vLLM extract_hidden_states, MXFP8 single-node TP8) + train
(FakeBaseModel on bf16). M3-specific: --block-size 128 (MSA), --language-model-only,
--enforce-eager + VLLM_RPC_TIMEOUT=1800000 (avoid MSA Triton-kernel JIT RPC-timeout hang),
seq-len 8192 end-to-end, mask token 200061 (200054 is a real special token in M3),
OVERRIDE_TRANSFORMERS 4.52.4, export-YaRN original_max_position 8192 / factor 24
(tunable; 128 for full 1M).
- chat_template_train.jinja: M3 chat template with {% generation %} wrapping the assistant
turn (think + content + tool_calls) for answer_only_loss; header + eos sit outside the
span, matching the M2.7 convention. Thinking-mode handling preserved verbatim. Validated:
generation spans cover exactly the assistant outputs across multi-turn + no-think turns.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…ctor 128) original_max_position_embeddings=8192 (training seq-len) x factor 128 = 1048576 = M3's full 1M context. Export-time tunable (factor 24 -> 196608 for the M2.7-equivalent target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Tracking PR for DFlash on MiniMax-M3, reusing the M2.7 DFlash work merged in #1621. Opening early with the first piece — the data-synthesis tooling — more to follow (offline recipe folder, mask-token / YaRN config for M3).
This PR so far
Per-conversation thinking-mode mix in data synthesis (
server_generate.py+distributed_generate/worker.sh):enabled/disabled/adaptive). To train a DFlash draft that generalizes across all of them, synthetic conversations should span the mix.server_generate.pygains--thinking-modes enabled,disabled,adaptive: conversationiusesmodes[i % len(modes)](even split), passed viachat_template_kwargs, and the chosen mode is recorded on each output record. Default empty → unchanged for models without thinking modes.--reasoning-parserfor synthesis, so the full<mm:think>+answer lands incontent(the draft must learn to draft the entire generated sequence).worker.shthreadsTHINKING_MODESthrough and addsVLLM_SERVE_EXTRA_ARGS/SGLANG_SERVE_EXTRA_ARGSpassthroughs for model-specific serve flags.M3 serving notes (validated on H100)
MiniMaxAI/MiniMax-M3-MXFP8serves single-node TP8 on H100 with imagevllm/vllm-openai:minimax-m3,--block-size 128(mandatory for MSA sparse attention) and--language-model-only(text-only). KV cache must stay bf16 — M3's MSA fused kernel rejects fp8 KV.To come
examples/.../MiniMax/MiniMax-M3-DFlash/offline recipe (mirroring M2.7, offline path).dflash_mask_token_id=200061(M3's 200054 is now a real special token) + YaRN export config.🤖 Generated with Claude Code