Background
The hybrid GDN forward passes allocate the KV cache fully upfront at `_maxSeqLen` (`CudaHybridGdnForwardPass.cs:534-535`) for the attention layers:
```csharp
_gpuKCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
_gpuVCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
```
At default ctx=4096 with the qwen35 layout (16 attention layers, 4 KV heads × 256 head dim), that's `16 × 4096 × 1024 × 4 B × 2` = 512 MiB of VRAM, all in F32.
Why this matters now
After the `exact: true` weight upload patch (under #25), the qwen35 27B-MTP path on a 12 GB RTX 4070 Ti has 21 of 64 FFN layers on GPU and decodes at 6.3 t/s. The next FFN layer would need ~210 MiB (165 MiB tensor + 50 MiB allocator overhead). Reclaiming 256 MiB by halving the KV cache to Bf16 would fit *1–2 more FFN layers* on GPU.
Expected decode impact: each additional FFN layer on GPU removes ~1.5 % of CPU FFN work. 2 layers = ~3 % decode speedup → ~6.5 t/s on the 27B. Modest, but cumulative with other optimizations.
Same calculation applies to qwen35moe (10 attention layers × the same headDim) — smaller absolute saving (~320 MiB) but same proportional unlock.
Scope
- Add a `KvCacheDType` field to `CudaHybridGdnForwardPass` (default Bf16 for hybrid GDN, F32 for legacy compat).
- Change `_gpuKCache[i]` / `_gpuVCache[i]` allocations to Bf16 when configured.
- Update KV append / attention kernels to accept Bf16 KV reads. CudaBackend already supports Bf16 GEMM (`SgemmPrecision`); the attention kernel may need a small variant.
- Validate parity vs F32 KV on a logit-checked test (existing test pattern in `Tests.ForwardPass`).
- Env var `SHARPI_KV_DTYPE=fp32|bf16` for bisecting if Bf16 turns out lossy for long context.
Out of scope
- Dense-only paths (`CudaForwardPass`, `CudaHybridForwardPass`) — separate measurement needed; perf might not justify there since they have less KV-cache pressure.
- INT8 or 4-bit KV (TurboQuant already covers that direction for non-GDN models).
Background
The hybrid GDN forward passes allocate the KV cache fully upfront at `_maxSeqLen` (`CudaHybridGdnForwardPass.cs:534-535`) for the attention layers:
```csharp
_gpuKCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
_gpuVCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
```
At default ctx=4096 with the qwen35 layout (16 attention layers, 4 KV heads × 256 head dim), that's `16 × 4096 × 1024 × 4 B × 2` = 512 MiB of VRAM, all in F32.
Why this matters now
After the `exact: true` weight upload patch (under #25), the qwen35 27B-MTP path on a 12 GB RTX 4070 Ti has 21 of 64 FFN layers on GPU and decodes at 6.3 t/s. The next FFN layer would need ~210 MiB (165 MiB tensor + 50 MiB allocator overhead). Reclaiming 256 MiB by halving the KV cache to Bf16 would fit *1–2 more FFN layers* on GPU.
Expected decode impact: each additional FFN layer on GPU removes ~1.5 % of CPU FFN work. 2 layers = ~3 % decode speedup → ~6.5 t/s on the 27B. Modest, but cumulative with other optimizations.
Same calculation applies to qwen35moe (10 attention layers × the same headDim) — smaller absolute saving (~320 MiB) but same proportional unlock.
Scope
Out of scope