Wire MtpDecoder into RunCommand (CLI) decode loop + thinking-mode coexistence

## Background

`InferenceEngine.GenerateChunksAsync` routes greedy non-thinking decode through `MtpDecoder` (commit a28bac4, #25). The CLI (`src/SharpInference.Cli/RunCommand.cs`) uses its OWN decode loop and never calls `InferenceEngine`, so MTP is silently bypassed when using `sharpi-cli` directly against an MTP-enabled model.

Today only the server (`SharpInference.Server/Program.cs`) goes through `InferenceEngine` and exercises MTP.

## Two intertwined gaps to close

### (1) CLI integration

Either:
- (a) Route `RunCommand.DecodeLoop` through `MtpDecoder` when `_fwd.HasMtpHead && temperature == 0`. Adds branching inside `DecodeLoop` similar to what landed in `InferenceEngine`.
- (b) Replace `RunCommand.DecodeLoop` with `InferenceEngine.GenerateChunksAsync` so both surfaces share the same decode plumbing (long-term cleaner).

Approach (a) is the shorter path; approach (b) consolidates feature support (prefix cache, snapshot capture, MTP) into one code path.

### (2) Thinking-mode + MTP coexistence

Current gate in `InferenceEngine` disables MTP when `thinkingEnabled` (i.e. the model exposes `<think>`/`</think>` tokens — true for Qwen3.6 family). Workaround: `--no-thinking` on the CLI, which sets `enable_thinking=false` in the chat template; the model doesn't emit think tokens, but the engine's `thinkingEnabled` flag is still set, so MTP is still gated off.

To enable MTP with thinking-capable models:
- Thread `thinkId` / `endThinkId` through `MtpDecoder.Decode` as additional state, OR
- Have the engine's emit-callback wrap chunks with state-flipping logic (current `InferenceEngine` MTP path already uses a callback — extend it to detect think boundaries before writing to channel).
- Decide what to do about `MaxThinkingTokens` enforcement — forcing `</think>` mid-decode interacts oddly with MTP's two-token-per-iter cadence. Simplest v1: skip `MaxThinkingTokens` for MTP and document as a limitation.

## Acceptance criteria

- [ ] `sharpi-cli ... -m models/Qwen3.6-27B-MTP-Q4_K_M.gguf --temp 0 -p \"...\"` (CPU) engages MTP — verifiable via `SHARPI_TRACE_MTP=1`.
- [ ] Same with `--no-thinking` AND with default (thinking on) on the same model produces correct output through MTP.
- [ ] No regression on non-MTP models (still uses the baseline decode loop).
- [ ] Existing CLI tests (if any cover decode flow) still pass.

## Out of scope

- Speedup measurement (depends on #30).
- Non-greedy MTP sampling — `MtpDecoder` is greedy-only; temperature > 0 will continue to use baseline path.

## Related

- Parent: #25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire MtpDecoder into RunCommand (CLI) decode loop + thinking-mode coexistence #32

Background

Two intertwined gaps to close

(1) CLI integration

(2) Thinking-mode + MTP coexistence

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wire MtpDecoder into RunCommand (CLI) decode loop + thinking-mode coexistence #32

Description

Background

Two intertwined gaps to close

(1) CLI integration

(2) Thinking-mode + MTP coexistence

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions