Bf16 KV cache for hybrid GDN models (qwen35, qwen35moe)

## Background

The hybrid GDN forward passes allocate the KV cache fully upfront at \`_maxSeqLen\` (\`CudaHybridGdnForwardPass.cs:534-535\`) for the attention layers:

\`\`\`csharp
_gpuKCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim));  // F32
_gpuVCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim));  // F32
\`\`\`

At default ctx=4096 with the qwen35 layout (16 attention layers, 4 KV heads × 256 head dim), that's \`16 × 4096 × 1024 × 4 B × 2\` = **512 MiB** of VRAM, all in F32.

## Why this matters now

After the \`exact: true\` weight upload patch (under #25), the qwen35 27B-MTP path on a 12 GB RTX 4070 Ti has 21 of 64 FFN layers on GPU and decodes at 6.3 t/s. The next FFN layer would need \~210 MiB (165 MiB tensor + 50 MiB allocator overhead). Reclaiming 256 MiB by halving the KV cache to Bf16 would fit \**1–2 more FFN layers** on GPU.

Expected decode impact: each additional FFN layer on GPU removes \~1.5 % of CPU FFN work. 2 layers = \~3 % decode speedup → \~6.5 t/s on the 27B. Modest, but cumulative with other optimizations.

Same calculation applies to qwen35moe (10 attention layers × the same headDim) — smaller absolute saving (\~320 MiB) but same proportional unlock.

## Scope

1. Add a \`KvCacheDType\` field to \`CudaHybridGdnForwardPass\` (default Bf16 for hybrid GDN, F32 for legacy compat).
2. Change \`_gpuKCache[i]\` / \`_gpuVCache[i]\` allocations to Bf16 when configured.
3. Update KV append / attention kernels to accept Bf16 KV reads. CudaBackend already supports Bf16 GEMM (\`SgemmPrecision\`); the attention kernel may need a small variant.
4. Validate parity vs F32 KV on a logit-checked test (existing test pattern in \`Tests.ForwardPass\`).
5. Env var \`SHARPI_KV_DTYPE=fp32|bf16\` for bisecting if Bf16 turns out lossy for long context.

## Out of scope

- Dense-only paths (\`CudaForwardPass\`, \`CudaHybridForwardPass\`) — separate measurement needed; perf might not justify there since they have less KV-cache pressure.
- INT8 or 4-bit KV (TurboQuant already covers that direction for non-GDN models).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bf16 KV cache for hybrid GDN models (qwen35, qwen35moe) #27

Background

Why this matters now

Scope

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bf16 KV cache for hybrid GDN models (qwen35, qwen35moe) #27

Description

Background

Why this matters now

Scope

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions