Background
CPU MTP foundation (#25, commit a28bac4) ships a smoke test that asserts finite + non-degenerate MTP head logits, but does NOT verify the actual emitted tokens match llama.cpp's --spec-type draft-mtp --spec-draft-n-max 2 output. Without parity, we don't know if the MTP head is correctly wired (concat order, eh_proj orientation, partial-RoPE on the MTP attn block, etc.).
Scope
-
Capture reference output from llama-cli (or llama-server greedy) on the Qwen3.6-27B-MTP file with --spec-type draft-mtp --spec-draft-n-max 2, greedy, fixed seed, fixed prompt, >=60 tokens.
-
Capture our output from sharpi-cli (when CLI MTP routing lands — depends on #(cli-routing-issue), or via a test that drives InferenceEngine.GenerateAsync directly) on the same model + prompt + greedy. Confirm SHARPI_TRACE_MTP shows MTP was actually used.
-
Compare token-for-token. Expect:
- Bit-identical when MTP is disabled (
SHARPI_DISABLE_MTP=1) vs llama.cpp with --spec-draft-n-max 0.
- Bit-identical or within Q4_K_M roundoff when MTP is enabled vs llama.cpp with
--spec-draft-n-max 2 — because both use greedy correction, divergences in the MTP draft path don't affect emitted tokens.
-
If mismatched, bisect using SHARPI_TRACE_LAYERS=1 and llama.cpp's eval-callback dump. Likely culprits documented in docs/qwen35moe-plan.md Phase 5 (concat order, eh_proj orientation, partial RoPE on MTP attn).
Acceptance criteria
Out of scope
- Speedup measurement — separate issue.
- Multimodal / vision parity — model has text-only fixture.
Related
Background
CPU MTP foundation (#25, commit a28bac4) ships a smoke test that asserts finite + non-degenerate MTP head logits, but does NOT verify the actual emitted tokens match llama.cpp's
--spec-type draft-mtp --spec-draft-n-max 2output. Without parity, we don't know if the MTP head is correctly wired (concat order, eh_proj orientation, partial-RoPE on the MTP attn block, etc.).Scope
Capture reference output from
llama-cli(orllama-servergreedy) on the Qwen3.6-27B-MTP file with--spec-type draft-mtp --spec-draft-n-max 2, greedy, fixed seed, fixed prompt, >=60 tokens.Capture our output from
sharpi-cli(when CLI MTP routing lands — depends on #(cli-routing-issue), or via a test that drivesInferenceEngine.GenerateAsyncdirectly) on the same model + prompt + greedy. Confirm SHARPI_TRACE_MTP shows MTP was actually used.Compare token-for-token. Expect:
SHARPI_DISABLE_MTP=1) vs llama.cpp with--spec-draft-n-max 0.--spec-draft-n-max 2— because both use greedy correction, divergences in the MTP draft path don't affect emitted tokens.If mismatched, bisect using
SHARPI_TRACE_LAYERS=1and llama.cpp's eval-callback dump. Likely culprits documented indocs/qwen35moe-plan.mdPhase 5 (concat order, eh_proj orientation, partial RoPE on MTP attn).Acceptance criteria
tests/fixtures/mtp_parity_27b.txt(or similar).Tests.ForwardPassor a newTests.Mtpproject:MtpDecoder_GreedyParity_LlamaCppreads the fixture and asserts >= 60 byte-identical tokens.HybridGdnForwardPassTestspattern).Out of scope
Related