Skip to content

Wire MtpDecoder into RunCommand (CLI) decode loop + thinking-mode coexistence #32

@pekkah

Description

@pekkah

Background

InferenceEngine.GenerateChunksAsync routes greedy non-thinking decode through MtpDecoder (commit a28bac4, #25). The CLI (src/SharpInference.Cli/RunCommand.cs) uses its OWN decode loop and never calls InferenceEngine, so MTP is silently bypassed when using sharpi-cli directly against an MTP-enabled model.

Today only the server (SharpInference.Server/Program.cs) goes through InferenceEngine and exercises MTP.

Two intertwined gaps to close

(1) CLI integration

Either:

  • (a) Route RunCommand.DecodeLoop through MtpDecoder when _fwd.HasMtpHead && temperature == 0. Adds branching inside DecodeLoop similar to what landed in InferenceEngine.
  • (b) Replace RunCommand.DecodeLoop with InferenceEngine.GenerateChunksAsync so both surfaces share the same decode plumbing (long-term cleaner).

Approach (a) is the shorter path; approach (b) consolidates feature support (prefix cache, snapshot capture, MTP) into one code path.

(2) Thinking-mode + MTP coexistence

Current gate in InferenceEngine disables MTP when thinkingEnabled (i.e. the model exposes <think>/</think> tokens — true for Qwen3.6 family). Workaround: --no-thinking on the CLI, which sets enable_thinking=false in the chat template; the model doesn't emit think tokens, but the engine's thinkingEnabled flag is still set, so MTP is still gated off.

To enable MTP with thinking-capable models:

  • Thread thinkId / endThinkId through MtpDecoder.Decode as additional state, OR
  • Have the engine's emit-callback wrap chunks with state-flipping logic (current InferenceEngine MTP path already uses a callback — extend it to detect think boundaries before writing to channel).
  • Decide what to do about MaxThinkingTokens enforcement — forcing </think> mid-decode interacts oddly with MTP's two-token-per-iter cadence. Simplest v1: skip MaxThinkingTokens for MTP and document as a limitation.

Acceptance criteria

  • sharpi-cli ... -m models/Qwen3.6-27B-MTP-Q4_K_M.gguf --temp 0 -p \"...\" (CPU) engages MTP — verifiable via SHARPI_TRACE_MTP=1.
  • Same with --no-thinking AND with default (thinking on) on the same model produces correct output through MTP.
  • No regression on non-MTP models (still uses the baseline decode loop).
  • Existing CLI tests (if any cover decode flow) still pass.

Out of scope

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions