Apply exact-size weight uploads to other CUDA forward passes (CudaForwardPass, CudaHybridForwardPass)

## Background

During work on #25 (qwen35 27B-MTP loading), I discovered that `GpuBufferPool.RoundUp` (`CudaBackend.cs:2120-2127`) rounds every device allocation to the next power of two. For session-lifetime weight uploads — which never get freed/realloc'd during decode — this is pure VRAM waste. A 17 MiB \`attn_gate\` tensor rounds to 32 MiB; aggregated across 64 layers it was **~3 GiB of phantom VRAM** that nobody could account for on the 27B path.

## What landed for `CudaHybridGdnForwardPass` (under #25)

- `IComputeBackend.Allocate` / `Upload` / `UploadRaw` now take \`bool exact = false\`.
- `CudaBackend` exact path: bypass \`_pool.Rent\`, allocate via \`cudaMalloc(byteSize)\` with no rounding; track in \`_exactHandles\` for direct \`cudaFree\` on \`Free()\`.
- `CudaHybridGdnForwardPass.UploadWeight` and `UploadEmbeddingWeight` pass \`exact: true\` for every weight tensor.
- Measured effect on Qwen3.6-27B-MTP Q4_K_M: free VRAM after weight load went from **525 MiB → 3576 MiB** (3 GiB reclaimed). Enabled 21 of 64 FFN layers on GPU instead of 2; decode 4.0 → 6.3 t/s (+58 %).

## What's left for this issue

The same VRAM waste exists in the dense-only forward passes that haven't been migrated:

- `src/SharpInference.Engine/CudaForwardPass.cs` — all-GPU dense path (Qwen3-8B, smollm2, etc.). Find its \`UploadWeight\`/\`UploadEmbeddingWeight\` analogs and add \`exact: true\`.
- `src/SharpInference.Engine/CudaHybridForwardPass.cs` — GPU/CPU split dense path (used by larger non-hybrid models). Same change.

Expected impact:
- **Qwen3-8B** at 16K+ context (where KV cache pressure matters): more headroom for larger contexts before falling back to hybrid.
- **Qwen3-Coder-30B-A3B** and similar mid-size MoE: potentially fits SLRU with more capacity (each expert is ~1.8 MiB; reclaimed VRAM = more slots = higher hit rate).
- **Llama-4-Scout** (60 GB at Q4_K_M): the existing path probably hits CudaHybridForwardPass; reclaiming VRAM lets more layers stay GPU-resident.

## Verification

- Run `SHARPI_TRACE_VRAM=1` (already added to `CudaHybridGdnForwardPass` this session) on each forward pass class — gates the same trace on env var, prints free VRAM at constructor entry, after embedding, after per-layer upload.
- Add equivalent trace to the two target classes; compare before/after exact-size.
- Validate that Free-on-Dispose correctly cudaFrees the exact allocations (check `_exactHandles` membership).

## Scope

Code change is mostly mechanical (pass `exact: true` at the call sites). The interface change already landed under #25. Tests should pass unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply exact-size weight uploads to other CUDA forward passes (CudaForwardPass, CudaHybridForwardPass) #26

Background

What landed for `CudaHybridGdnForwardPass` (under #25)

What's left for this issue

Verification

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Apply exact-size weight uploads to other CUDA forward passes (CudaForwardPass, CudaHybridForwardPass) #26

Description

Background

What landed for CudaHybridGdnForwardPass (under #25)

What's left for this issue

Verification

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

What landed for `CudaHybridGdnForwardPass` (under #25)