MTP greedy parity vs llama.cpp --spec-type draft-mtp

## Background

CPU MTP foundation (#25, commit a28bac4) ships a smoke test that asserts finite + non-degenerate MTP head logits, but does NOT verify the actual emitted tokens match llama.cpp's `--spec-type draft-mtp --spec-draft-n-max 2` output. Without parity, we don't know if the MTP head is *correctly* wired (concat order, eh_proj orientation, partial-RoPE on the MTP attn block, etc.).

## Scope

1. **Capture reference output** from `llama-cli` (or `llama-server` greedy) on the Qwen3.6-27B-MTP file with `--spec-type draft-mtp --spec-draft-n-max 2`, greedy, fixed seed, fixed prompt, >=60 tokens.

2. **Capture our output** from `sharpi-cli` (when CLI MTP routing lands — depends on #(cli-routing-issue), or via a test that drives `InferenceEngine.GenerateAsync` directly) on the same model + prompt + greedy. Confirm SHARPI_TRACE_MTP shows MTP was actually used.

3. **Compare token-for-token**. Expect:
   - **Bit-identical** when MTP is disabled (`SHARPI_DISABLE_MTP=1`) vs llama.cpp with `--spec-draft-n-max 0`.
   - **Bit-identical or within Q4_K_M roundoff** when MTP is enabled vs llama.cpp with `--spec-draft-n-max 2` — because both use greedy correction, divergences in the MTP draft path don't affect emitted tokens.

4. **If mismatched**, bisect using `SHARPI_TRACE_LAYERS=1` and llama.cpp's eval-callback dump. Likely culprits documented in `docs/qwen35moe-plan.md` Phase 5 (concat order, eh_proj orientation, partial RoPE on MTP attn).

## Acceptance criteria

- [ ] Reference dump from llama.cpp captured + checked in under `tests/fixtures/mtp_parity_27b.txt` (or similar).
- [ ] New test in `Tests.ForwardPass` or a new `Tests.Mtp` project: `MtpDecoder_GreedyParity_LlamaCpp` reads the fixture and asserts >= 60 byte-identical tokens.
- [ ] Test silently skips when the 27B-MTP file isn't on disk (mirror existing `HybridGdnForwardPassTests` pattern).

## Out of scope

- Speedup measurement — separate issue.
- Multimodal / vision parity — model has text-only fixture.

## Related

- Parent: #25
- Depends on: #29 (CUDA hybrid MTP) only for parity of the GPU path; CPU parity is testable today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTP greedy parity vs llama.cpp --spec-type draft-mtp #31

Background

Scope

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MTP greedy parity vs llama.cpp --spec-type draft-mtp #31

Description

Background

Scope

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions