Skip to content

Apply exact-size weight uploads to other CUDA forward passes (CudaForwardPass, CudaHybridForwardPass) #26

@pekkah

Description

@pekkah

Background

During work on #25 (qwen35 27B-MTP loading), I discovered that GpuBufferPool.RoundUp (CudaBackend.cs:2120-2127) rounds every device allocation to the next power of two. For session-lifetime weight uploads — which never get freed/realloc'd during decode — this is pure VRAM waste. A 17 MiB `attn_gate` tensor rounds to 32 MiB; aggregated across 64 layers it was ~3 GiB of phantom VRAM that nobody could account for on the 27B path.

What landed for CudaHybridGdnForwardPass (under #25)

  • IComputeBackend.Allocate / Upload / UploadRaw now take `bool exact = false`.
  • CudaBackend exact path: bypass `_pool.Rent`, allocate via `cudaMalloc(byteSize)` with no rounding; track in `_exactHandles` for direct `cudaFree` on `Free()`.
  • CudaHybridGdnForwardPass.UploadWeight and UploadEmbeddingWeight pass `exact: true` for every weight tensor.
  • Measured effect on Qwen3.6-27B-MTP Q4_K_M: free VRAM after weight load went from 525 MiB → 3576 MiB (3 GiB reclaimed). Enabled 21 of 64 FFN layers on GPU instead of 2; decode 4.0 → 6.3 t/s (+58 %).

What's left for this issue

The same VRAM waste exists in the dense-only forward passes that haven't been migrated:

  • src/SharpInference.Engine/CudaForwardPass.cs — all-GPU dense path (Qwen3-8B, smollm2, etc.). Find its `UploadWeight`/`UploadEmbeddingWeight` analogs and add `exact: true`.
  • src/SharpInference.Engine/CudaHybridForwardPass.cs — GPU/CPU split dense path (used by larger non-hybrid models). Same change.

Expected impact:

  • Qwen3-8B at 16K+ context (where KV cache pressure matters): more headroom for larger contexts before falling back to hybrid.
  • Qwen3-Coder-30B-A3B and similar mid-size MoE: potentially fits SLRU with more capacity (each expert is ~1.8 MiB; reclaimed VRAM = more slots = higher hit rate).
  • Llama-4-Scout (60 GB at Q4_K_M): the existing path probably hits CudaHybridForwardPass; reclaiming VRAM lets more layers stay GPU-resident.

Verification

  • Run SHARPI_TRACE_VRAM=1 (already added to CudaHybridGdnForwardPass this session) on each forward pass class — gates the same trace on env var, prints free VRAM at constructor entry, after embedding, after per-layer upload.
  • Add equivalent trace to the two target classes; compare before/after exact-size.
  • Validate that Free-on-Dispose correctly cudaFrees the exact allocations (check _exactHandles membership).

Scope

Code change is mostly mechanical (pass exact: true at the call sites). The interface change already landed under #25. Tests should pass unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions