Background
During work on #25 (qwen35 27B-MTP loading), I discovered that GpuBufferPool.RoundUp (CudaBackend.cs:2120-2127) rounds every device allocation to the next power of two. For session-lifetime weight uploads — which never get freed/realloc'd during decode — this is pure VRAM waste. A 17 MiB `attn_gate` tensor rounds to 32 MiB; aggregated across 64 layers it was ~3 GiB of phantom VRAM that nobody could account for on the 27B path.
What landed for CudaHybridGdnForwardPass (under #25)
IComputeBackend.Allocate / Upload / UploadRaw now take `bool exact = false`.
CudaBackend exact path: bypass `_pool.Rent`, allocate via `cudaMalloc(byteSize)` with no rounding; track in `_exactHandles` for direct `cudaFree` on `Free()`.
CudaHybridGdnForwardPass.UploadWeight and UploadEmbeddingWeight pass `exact: true` for every weight tensor.
- Measured effect on Qwen3.6-27B-MTP Q4_K_M: free VRAM after weight load went from 525 MiB → 3576 MiB (3 GiB reclaimed). Enabled 21 of 64 FFN layers on GPU instead of 2; decode 4.0 → 6.3 t/s (+58 %).
What's left for this issue
The same VRAM waste exists in the dense-only forward passes that haven't been migrated:
src/SharpInference.Engine/CudaForwardPass.cs — all-GPU dense path (Qwen3-8B, smollm2, etc.). Find its `UploadWeight`/`UploadEmbeddingWeight` analogs and add `exact: true`.
src/SharpInference.Engine/CudaHybridForwardPass.cs — GPU/CPU split dense path (used by larger non-hybrid models). Same change.
Expected impact:
- Qwen3-8B at 16K+ context (where KV cache pressure matters): more headroom for larger contexts before falling back to hybrid.
- Qwen3-Coder-30B-A3B and similar mid-size MoE: potentially fits SLRU with more capacity (each expert is ~1.8 MiB; reclaimed VRAM = more slots = higher hit rate).
- Llama-4-Scout (60 GB at Q4_K_M): the existing path probably hits CudaHybridForwardPass; reclaiming VRAM lets more layers stay GPU-resident.
Verification
- Run
SHARPI_TRACE_VRAM=1 (already added to CudaHybridGdnForwardPass this session) on each forward pass class — gates the same trace on env var, prints free VRAM at constructor entry, after embedding, after per-layer upload.
- Add equivalent trace to the two target classes; compare before/after exact-size.
- Validate that Free-on-Dispose correctly cudaFrees the exact allocations (check
_exactHandles membership).
Scope
Code change is mostly mechanical (pass exact: true at the call sites). The interface change already landed under #25. Tests should pass unchanged.
Background
During work on #25 (qwen35 27B-MTP loading), I discovered that
GpuBufferPool.RoundUp(CudaBackend.cs:2120-2127) rounds every device allocation to the next power of two. For session-lifetime weight uploads — which never get freed/realloc'd during decode — this is pure VRAM waste. A 17 MiB `attn_gate` tensor rounds to 32 MiB; aggregated across 64 layers it was ~3 GiB of phantom VRAM that nobody could account for on the 27B path.What landed for
CudaHybridGdnForwardPass(under #25)IComputeBackend.Allocate/Upload/UploadRawnow take `bool exact = false`.CudaBackendexact path: bypass `_pool.Rent`, allocate via `cudaMalloc(byteSize)` with no rounding; track in `_exactHandles` for direct `cudaFree` on `Free()`.CudaHybridGdnForwardPass.UploadWeightandUploadEmbeddingWeightpass `exact: true` for every weight tensor.What's left for this issue
The same VRAM waste exists in the dense-only forward passes that haven't been migrated:
src/SharpInference.Engine/CudaForwardPass.cs— all-GPU dense path (Qwen3-8B, smollm2, etc.). Find its `UploadWeight`/`UploadEmbeddingWeight` analogs and add `exact: true`.src/SharpInference.Engine/CudaHybridForwardPass.cs— GPU/CPU split dense path (used by larger non-hybrid models). Same change.Expected impact:
Verification
SHARPI_TRACE_VRAM=1(already added toCudaHybridGdnForwardPassthis session) on each forward pass class — gates the same trace on env var, prints free VRAM at constructor entry, after embedding, after per-layer upload._exactHandlesmembership).Scope
Code change is mostly mechanical (pass
exact: trueat the call sites). The interface change already landed under #25. Tests should pass unchanged.