Skip to content

DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749

Draft
yeyu-nvidia wants to merge 5 commits into
mainfrom
yeyu/dflash-minimax-m3
Draft

DFlash for MiniMax-M3 (WIP): synthesis thinking-mode mix#1749
yeyu-nvidia wants to merge 5 commits into
mainfrom
yeyu/dflash-minimax-m3

Conversation

@yeyu-nvidia

Copy link
Copy Markdown
Contributor

Tracking PR for DFlash on MiniMax-M3, reusing the M2.7 DFlash work merged in #1621. Opening early with the first piece — the data-synthesis tooling — more to follow (offline recipe folder, mask-token / YaRN config for M3).

This PR so far

Per-conversation thinking-mode mix in data synthesis (server_generate.py + distributed_generate/worker.sh):

  • M3 has three thinking modes (enabled / disabled / adaptive). To train a DFlash draft that generalizes across all of them, synthetic conversations should span the mix.
  • server_generate.py gains --thinking-modes enabled,disabled,adaptive: conversation i uses modes[i % len(modes)] (even split), passed via chat_template_kwargs, and the chosen mode is recorded on each output record. Default empty → unchanged for models without thinking modes.
  • We intentionally do not enable the server's --reasoning-parser for synthesis, so the full <mm:think>+answer lands in content (the draft must learn to draft the entire generated sequence).
  • worker.sh threads THINKING_MODES through and adds VLLM_SERVE_EXTRA_ARGS / SGLANG_SERVE_EXTRA_ARGS passthroughs for model-specific serve flags.

M3 serving notes (validated on H100)

MiniMaxAI/MiniMax-M3-MXFP8 serves single-node TP8 on H100 with image vllm/vllm-openai:minimax-m3, --block-size 128 (mandatory for MSA sparse attention) and --language-model-only (text-only). KV cache must stay bf16 — M3's MSA fused kernel rejects fp8 KV.

To come

  • examples/.../MiniMax/MiniMax-M3-DFlash/ offline recipe (mirroring M2.7, offline path).
  • M3 dflash_mask_token_id=200061 (M3's 200054 is now a real special token) + YaRN export config.

🤖 Generated with Claude Code

…niMax-M3)

Adds a --thinking-modes cycle to server_generate.py so synthetic conversations are
generated across a mix of thinking modes — e.g. MiniMax-M3's enabled/disabled/adaptive,
passed via chat_template_kwargs — so a DFlash/EAGLE draft trained on the data generalizes
across modes. Conversation i uses modes[i % len(modes)] for an even split; the mode is
recorded on each output record. Empty (default) sends no thinking_mode, unchanged for
models without it.

distributed_generate/worker.sh: pass THINKING_MODES through to server_generate.py, and add
VLLM_SERVE_EXTRA_ARGS / SGLANG_SERVE_EXTRA_ARGS passthroughs for model-specific serve flags
(M3 needs --block-size 128 for MSA sparse attention and --language-model-only for text-only
synthesis; KV cache stays bf16 — M3's MSA fused kernel rejects fp8 KV).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 406af45f-c98a-4e64-8674-f10cd7bb2f49

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/dflash-minimax-m3

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.75%. Comparing base (e6790ef) to head (0aa2880).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1749      +/-   ##
==========================================
- Coverage   77.12%   75.75%   -1.37%     
==========================================
  Files         511      511              
  Lines       56273    58061    +1788     
==========================================
+ Hits        43399    43985     +586     
- Misses      12874    14076    +1202     
Flag Coverage Δ
unit 54.39% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yeyu-nvidia and others added 4 commits June 16, 2026 11:51
…tput

While wiring MiniMax-M3 synthesis (prompt set Speculative-Decoding-Dataset-v2 is OAI-format):
- Accept both 'conversations' (ShareGPT) and 'messages' (OAI) prompt datasets on input
  (previously KeyError'd on 'messages').
- Add --output-format {oai,sharegpt} (default oai): emit the OpenAI standard
  {'messages': [{role, content}, ...]} instead of the legacy {'conversations': [...]}.
  Pass --output-format sharegpt for the old key.

Validated end-to-end on MiniMax-M3-MXFP8 (single-node TP8 H100): the 3-way thinking-mode
mix renders correctly (disabled -> direct answer; adaptive -> <mm:think> reasoning captured
in content since no reasoning-parser), OAI in/out flows into the vLLM hidden-state dump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…747)

Validated extract_hidden_states on MiniMax-M3-MXFP8 (single-node TP8 H100). Required M3
enablement in compute_hidden_states_vllm.py:
- --block-size (M3's MSA sparse attention mandates 128; default None elsewhere).
- --enforce-eager: M3's MSA Triton kernel (_gqa_sparse_fwd_kernel) JIT-recompiles per
  input shape; under cudagraph capture a recompile blows the executor RPC timeout and
  hangs the engine (sample_tokens timeout). Eager mode + a long VLLM_RPC_TIMEOUT fixes it.
- --language-model-only: skip the vision encoder for text-only dumps (M3 is VL).
- Read num_hidden_layers from text_config/llm_config for wrapped VL configs
  (MiniMaxM3VLConfig nests it; previously raised 'no num_hidden_layers attribute').

Output verified: per-conv .pt with input_ids / hidden_states (T,6144) / aux_hidden_states /
loss_mask (length-matched) / conversation_id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…e (OMNIML-4747)

2-step offline DFlash recipe for MiniMax-M3 (427B VL-MoE), mirroring MiniMax-M2.7-DFlash:
- hf_offline_dflash.yaml: dump (vLLM extract_hidden_states, MXFP8 single-node TP8) + train
  (FakeBaseModel on bf16). M3-specific: --block-size 128 (MSA), --language-model-only,
  --enforce-eager + VLLM_RPC_TIMEOUT=1800000 (avoid MSA Triton-kernel JIT RPC-timeout hang),
  seq-len 8192 end-to-end, mask token 200061 (200054 is a real special token in M3),
  OVERRIDE_TRANSFORMERS 4.52.4, export-YaRN original_max_position 8192 / factor 24
  (tunable; 128 for full 1M).
- chat_template_train.jinja: M3 chat template with {% generation %} wrapping the assistant
  turn (think + content + tool_calls) for answer_only_loss; header + eos sit outside the
  span, matching the M2.7 convention. Thinking-mode handling preserved verbatim. Validated:
  generation spans cover exactly the assistant outputs across multi-turn + no-think turns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…ctor 128)

original_max_position_embeddings=8192 (training seq-len) x factor 128 = 1048576 = M3's
full 1M context. Export-time tunable (factor 24 -> 196608 for the M2.7-equivalent target).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant