Skip to content

Bf16 KV cache for hybrid GDN models (qwen35, qwen35moe) #27

@pekkah

Description

@pekkah

Background

The hybrid GDN forward passes allocate the KV cache fully upfront at `_maxSeqLen` (`CudaHybridGdnForwardPass.cs:534-535`) for the attention layers:

```csharp
_gpuKCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
_gpuVCache[i] = gpu.Allocate(TensorShape.D1((long)_maxSeqLen * kvDim)); // F32
```

At default ctx=4096 with the qwen35 layout (16 attention layers, 4 KV heads × 256 head dim), that's `16 × 4096 × 1024 × 4 B × 2` = 512 MiB of VRAM, all in F32.

Why this matters now

After the `exact: true` weight upload patch (under #25), the qwen35 27B-MTP path on a 12 GB RTX 4070 Ti has 21 of 64 FFN layers on GPU and decodes at 6.3 t/s. The next FFN layer would need ~210 MiB (165 MiB tensor + 50 MiB allocator overhead). Reclaiming 256 MiB by halving the KV cache to Bf16 would fit *1–2 more FFN layers* on GPU.

Expected decode impact: each additional FFN layer on GPU removes ~1.5 % of CPU FFN work. 2 layers = ~3 % decode speedup → ~6.5 t/s on the 27B. Modest, but cumulative with other optimizations.

Same calculation applies to qwen35moe (10 attention layers × the same headDim) — smaller absolute saving (~320 MiB) but same proportional unlock.

Scope

  1. Add a `KvCacheDType` field to `CudaHybridGdnForwardPass` (default Bf16 for hybrid GDN, F32 for legacy compat).
  2. Change `_gpuKCache[i]` / `_gpuVCache[i]` allocations to Bf16 when configured.
  3. Update KV append / attention kernels to accept Bf16 KV reads. CudaBackend already supports Bf16 GEMM (`SgemmPrecision`); the attention kernel may need a small variant.
  4. Validate parity vs F32 KV on a logit-checked test (existing test pattern in `Tests.ForwardPass`).
  5. Env var `SHARPI_KV_DTYPE=fp32|bf16` for bisecting if Bf16 turns out lossy for long context.

Out of scope

  • Dense-only paths (`CudaForwardPass`, `CudaHybridForwardPass`) — separate measurement needed; perf might not justify there since they have less KV-cache pressure.
  • INT8 or 4-bit KV (TurboQuant already covers that direction for non-GDN models).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions