Background
InferenceEngine.GenerateChunksAsync routes greedy non-thinking decode through MtpDecoder (commit a28bac4, #25). The CLI (src/SharpInference.Cli/RunCommand.cs) uses its OWN decode loop and never calls InferenceEngine, so MTP is silently bypassed when using sharpi-cli directly against an MTP-enabled model.
Today only the server (SharpInference.Server/Program.cs) goes through InferenceEngine and exercises MTP.
Two intertwined gaps to close
(1) CLI integration
Either:
- (a) Route
RunCommand.DecodeLoop through MtpDecoder when _fwd.HasMtpHead && temperature == 0. Adds branching inside DecodeLoop similar to what landed in InferenceEngine.
- (b) Replace
RunCommand.DecodeLoop with InferenceEngine.GenerateChunksAsync so both surfaces share the same decode plumbing (long-term cleaner).
Approach (a) is the shorter path; approach (b) consolidates feature support (prefix cache, snapshot capture, MTP) into one code path.
(2) Thinking-mode + MTP coexistence
Current gate in InferenceEngine disables MTP when thinkingEnabled (i.e. the model exposes <think>/</think> tokens — true for Qwen3.6 family). Workaround: --no-thinking on the CLI, which sets enable_thinking=false in the chat template; the model doesn't emit think tokens, but the engine's thinkingEnabled flag is still set, so MTP is still gated off.
To enable MTP with thinking-capable models:
- Thread
thinkId / endThinkId through MtpDecoder.Decode as additional state, OR
- Have the engine's emit-callback wrap chunks with state-flipping logic (current
InferenceEngine MTP path already uses a callback — extend it to detect think boundaries before writing to channel).
- Decide what to do about
MaxThinkingTokens enforcement — forcing </think> mid-decode interacts oddly with MTP's two-token-per-iter cadence. Simplest v1: skip MaxThinkingTokens for MTP and document as a limitation.
Acceptance criteria
Out of scope
Related
Background
InferenceEngine.GenerateChunksAsyncroutes greedy non-thinking decode throughMtpDecoder(commit a28bac4, #25). The CLI (src/SharpInference.Cli/RunCommand.cs) uses its OWN decode loop and never callsInferenceEngine, so MTP is silently bypassed when usingsharpi-clidirectly against an MTP-enabled model.Today only the server (
SharpInference.Server/Program.cs) goes throughInferenceEngineand exercises MTP.Two intertwined gaps to close
(1) CLI integration
Either:
RunCommand.DecodeLoopthroughMtpDecoderwhen_fwd.HasMtpHead && temperature == 0. Adds branching insideDecodeLoopsimilar to what landed inInferenceEngine.RunCommand.DecodeLoopwithInferenceEngine.GenerateChunksAsyncso both surfaces share the same decode plumbing (long-term cleaner).Approach (a) is the shorter path; approach (b) consolidates feature support (prefix cache, snapshot capture, MTP) into one code path.
(2) Thinking-mode + MTP coexistence
Current gate in
InferenceEnginedisables MTP whenthinkingEnabled(i.e. the model exposes<think>/</think>tokens — true for Qwen3.6 family). Workaround:--no-thinkingon the CLI, which setsenable_thinking=falsein the chat template; the model doesn't emit think tokens, but the engine'sthinkingEnabledflag is still set, so MTP is still gated off.To enable MTP with thinking-capable models:
thinkId/endThinkIdthroughMtpDecoder.Decodeas additional state, ORInferenceEngineMTP path already uses a callback — extend it to detect think boundaries before writing to channel).MaxThinkingTokensenforcement — forcing</think>mid-decode interacts oddly with MTP's two-token-per-iter cadence. Simplest v1: skipMaxThinkingTokensfor MTP and document as a limitation.Acceptance criteria
sharpi-cli ... -m models/Qwen3.6-27B-MTP-Q4_K_M.gguf --temp 0 -p \"...\"(CPU) engages MTP — verifiable viaSHARPI_TRACE_MTP=1.--no-thinkingAND with default (thinking on) on the same model produces correct output through MTP.Out of scope
MtpDecoderis greedy-only; temperature > 0 will continue to use baseline path.Related