A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1.
Requirements: .NET 10 SDK, x86-64 CPU with AVX2.
Optional: Vulkan-capable GPU (drivers), CUDA Toolkit 11.x/12.x for NVIDIA paths,
OpenBLAS in tools/openblas/ for faster batched GEMM. Build with dotnet build -c Release.
Supported architectures: llama, llama4, qwen3, qwen3moe, qwen35moe
(hybrid Gated-DeltaNet + attention + MoE). Benchmarked on
AMD Zen 4 (12c/24t, DDR4-3200) + RTX 4070 Ti (12 GB), Q4_K_M, --temp 0,
-n 80, prompt "Write a Python function that sorts a list using the quicksort algorithm:".
Decode rate is forward-pass iterations / decode time, so it counts
thinking-mode tokens too. Outputs spot-checked for coherence
(scripts/bench-all.ps1); MoE on Vulkan hybrid is currently a known
broken row — see ⚠ note. Cross-engine top-1 parity vs llama.cpp b8585
verified on Qwen3-8B (byte-identical 60-token greedy decode with
matching chat template).
| Model | Repo | Size | Backend | Prefill t/s | Decode t/s | Notes |
|---|---|---|---|---|---|---|
| SmolLM2 1.7B Instruct | HuggingFaceTB | 1 GB | CPU | 16.6 | 38.9 | AVX2 fused dequant-matvec |
| SmolLM2 1.7B Instruct | (same) | 1 GB | Vulkan -g -1 |
42.0 | 139.7 | GLSL subgroupAdd reduce |
| SmolLM2 1.7B Instruct | (same) | 1 GB | CUDA -g -1 |
181.1 | 158.1 | NVRTC __dp4a + Q8_1 |
| Qwen3 8B | Qwen | 5 GB | Vulkan -g -1 |
23.0 | 45.8 | 11.4K auto-ctx |
| Qwen3 8B | (same) | 5 GB | Vulkan -g -1 --tq |
21.7 | 45.5 | 3-bit KV → 40 960 ctx |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 |
65.9 | 58.6 | ~2.8× Vulkan prefill |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 --no-thinking |
66.0 | 58.2 | Same per-token rate; reasoning suppressed in chat template, so all decoded tokens are visible answer |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 --tq |
65.9 | 58.4 | 3-bit KV → 40 960 ctx; 17 t/s @ 8K, 10 t/s @ 16K |
| Qwen3 8B | (same) | 5 GB | CUDA -g -1 --tq --no-thinking |
66.1 | 58.1 | Same per-token rate as --tq alone; reasoning suppressed |
| Qwen3-Coder 30B-A3B (MoE) | Qwen | 17 GB | CPU | 15.1 | 21.2 | 128 experts / 8 active |
| Qwen3-Coder 30B-A3B (MoE) | (same) | 17 GB | CPU --tq |
12.0 | 21.1 | 3-bit KV |
| Qwen3-Coder 30B-A3B (MoE) | (same) | 17 GB | Vulkan -g -1 (hybrid) |
1.0 | 10.2 | ⚠ output incoherent on this path — under investigation |
| Qwen3-Coder 30B-A3B (MoE) | (same) | 17 GB | CUDA -g -1 (hybrid) |
13.9 | 22.7 | 29 GPU + 19 CPU layers (auto), ~2.2× Vulkan decode |
| Llama-4 Scout 17B-16E (MoE) | meta-llama | 61 GB | CPU | 1.9 | 3.9 | 48 layers, 17B active params; split GGUF (Q4_K_M) |
| Llama-4 Scout 17B-16E (MoE) | (same) | 61 GB | CUDA -g -1 (hybrid) |
0.9 | 2.1 | 7 GPU + 41 CPU layers — model dwarfs the 12 GB card, PCIe cost > GPU speedup so CPU-only wins here |
| Qwen3.6-35B-A3B (GDN+MoE) | unsloth | 22 GB | CPU | 4.3 | 7.8 | hybrid GDN/attn, 256 experts / 8 active |
| Qwen3.6-35B-A3B (GDN+MoE) | (same) | 22 GB | CUDA -g -1 (hybrid) |
11.2 | 23.8 | 10 attn + 30 GDN on GPU; MoE auto-routed to CPU, batched-expert dispatch (8 experts × 3 ops into 2 Parallel.For sweeps), shared expert kept on GPU and overlapped with the CPU routed loop |
--backend auto (default) picks CUDA when available, sizing the GPU/CPU split from
VRAM via TierPlanner; falls through to Vulkan only when CUDA isn't present.
--tq enables 3-bit TurboQuant KV compression (CPU, Vulkan, CUDA; requires
headDim ∈ {128, 256}). MoE runs on GPU (full-offload or partial hybrid) on
both Vulkan and CUDA backends.
For hybrid SSM/attention models (qwen35moe), the CUDA backend keeps the
attention KV cache, the 30 Gated-DeltaNet layers (conv1d + rank-1 outer-product
recurrence), and the shared expert resident in VRAM; routed-expert dispatch
auto-selects between an SLRU GPU cache and CPU mmap reads based on what
fraction of experts can be cached at boot. Override with SHARPI_CPU_MOE=0|1.
Models that emit <think>...</think> (Qwen3, DeepSeek-R1, SmolLM3, …) are
detected automatically from their special tokens — no flag needed. The CLI
dims the reasoning stream as it generates. Use --no-thinking to disable
reasoning at the chat-template level, --hide-thinking to keep it on but
hide the stream, and --max-thinking-tokens N to force-close runaway
reasoning. Greedy decoding (--temp 0) on these models often loops, so
the CLI warns and recommends --temp 0.6 --top-p 0.95 --top-k 20.
The API server surfaces reasoning per each protocol's convention: Anthropic
/v1/messages emits a thinking content block before text; OpenAI
/v1/chat/completions exposes reasoning_content alongside content
(vLLM / DeepSeek style). Anthropic's thinking.budget_tokens and an OpenAI
extension max_thinking_tokens both map to the same engine-side budget.
Prior assistant turns in chat history have their <think> blocks stripped
before templating (Qwen3 and friends are trained without them).
# CPU, single-turn, greedy
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -p "What is 2+2?" --temp 0
# Full GPU offload (auto-picks CUDA on dense + full offload)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf -p "Write a quicksort in Python" --temp 0 -g -1
# MoE on CPU with 3-bit KV compression (5× less VRAM, full ctx)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Implement a BST in C#" --temp 0
# Interactive chat (no -p)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf
# Speculative decoding (~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
-p "Write a binary search in Rust" --temp 0
# Reasoning model: stream shows dimmed <think>...</think>, then the answer
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-8B-Q4_K_M.gguf -g -1 --temp 0.6 --top-p 0.95 --top-k 20 \
-p "What's 17 × 23?" --max-thinking-tokens 1024
# API server (OpenAI /v1/chat/completions + Anthropic /v1/messages, port 5000)
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
dotnet run --project src/SharpInference.Server -c ReleaseTwo pipelines, auto-detected from model filename. Benchmarked on AMD Zen 4
- RTX 4070 Ti (CUDA backend, 4 denoising steps, 512×512 output). The CLI is a one-shot binary, so each invocation pays the full load + text-encoder warmup. The "cached" column is the steady-state cost when the same encoder weights stay resident — e.g., re-rendering inside the server or interactive loop after the first prompt.
| Pipeline | Components (repo • file • size) | Per-run | Cached prompt | Notes |
|---|---|---|---|---|
| Z-Image-Turbo | DiT: jayn7/Z-Image-Turbo-GGUF z_image_turbo-Q5_K_M.gguf 5.5 GBEncoder: BennyDaBall/...-AbliteratedV1 Z-Image-AbliteratedV1.Q5_K_M.gguf 2.9 GBVAE + tokenizer: Tongyi-MAI/Z-Image-Turbo vae/ tokenizer/ |
~108 s | ~30 s | Most of the per-run cost is text-encoder warmup (~90 s); DiT ~4 s, VAE ~18 s once weights are hot. Output verified visually. |
| FLUX.1-schnell | DiT: city96/FLUX.1-schnell-gguf flux1-schnell-Q4_K_S.gguf ~7 GBEncoders + VAE: comfyanonymous/flux_text_encoders clip_l.safetensors + t5xxl_fp16.safetensors + ae.safetensors |
— | — | 4-step distilled; model not on this benchmark machine |
Optional 4× upscale via Real-ESRGAN (RealESRGAN_x4plus.safetensors):
runs on CUDA when available, falls back to bicubic.
# Z-Image-Turbo (auto-detects pipeline from filename containing "z_image")
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
-p "a serene mountain lake at sunrise" -W 1024 -H 1024 --steps 4 -o landscape.png
# FLUX.1-schnell
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/flux1-schnell-Q4_K_S.gguf \
--vae models/flux/ae.safetensors \
--clip-l models/flux/clip_l.safetensors --clip-tokenizer models/flux/tokenizer_clip.json \
--t5xxl models/flux/t5xxl_fp16.safetensors --t5-tokenizer models/flux/tokenizer_t5.json \
-p "a cinematic photograph of a mountain lake" -W 512 -H 512 --steps 4 -o out.png
# With 4× Real-ESRGAN upscale + blend
dotnet run --project src/SharpInference.Cli -c Release -- image \
-m models/z_image_turbo-Q5_K_M.gguf \
--vae models/z-image-turbo/vae \
--qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
--qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
--upscaler models/RealESRGAN_x4plus.safetensors --upscale-blend 0.8 \
-p "a fox in autumn forest" -W 512 -H 512 --steps 4 -o fox.png- Architecture & algorithms: docs/SharpInference-Design.md
- All CLI flags:
sharpi-cli --help,sharpi-cli image --help - Model downloads:
scripts/download-model.ps1 -Model <smollm2|qwen3-8b|qwen3-coder-30b-a3b|llama4-scout|z-image-turbo|realesrgan-x4|…> - Tests:
dotnet test - NativeAOT publish:
dotnet publish src/SharpInference.Cli -c Release -r win-x64
Released under the MIT License.