SharpInference

A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1.

Requirements: .NET 10 SDK, x86-64 CPU with AVX2. Optional: Vulkan-capable GPU (drivers), CUDA Toolkit 11.x/12.x for NVIDIA paths, OpenBLAS in tools/openblas/ for faster batched GEMM. Build with dotnet build -c Release.

Text generation

Supported architectures: llama, llama4, qwen3, qwen3moe, qwen35moe (hybrid Gated-DeltaNet + attention + MoE). Benchmarked on AMD Zen 4 (12c/24t, DDR4-3200) + RTX 4070 Ti (12 GB), Q4_K_M, --temp 0, -n 80, prompt "Write a Python function that sorts a list using the quicksort algorithm:". Decode rate is forward-pass iterations / decode time, so it counts thinking-mode tokens too. Outputs spot-checked for coherence (scripts/bench-all.ps1); MoE on Vulkan hybrid is currently a known broken row — see ⚠ note. Cross-engine top-1 parity vs llama.cpp b8585 verified on Qwen3-8B (byte-identical 60-token greedy decode with matching chat template).

Model	Repo	Size	Backend	Prefill t/s	Decode t/s	Notes
SmolLM2 1.7B Instruct	HuggingFaceTB	1 GB	CPU	16.6	38.9	AVX2 fused dequant-matvec
SmolLM2 1.7B Instruct	(same)	1 GB	Vulkan `-g -1`	42.0	139.7	GLSL `subgroupAdd` reduce
SmolLM2 1.7B Instruct	(same)	1 GB	CUDA `-g -1`	181.1	158.1	NVRTC `__dp4a` + Q8_1
Qwen3 8B	Qwen	5 GB	Vulkan `-g -1`	23.0	45.8	11.4K auto-ctx
Qwen3 8B	(same)	5 GB	Vulkan `-g -1 --tq`	21.7	45.5	3-bit KV → 40 960 ctx
Qwen3 8B	(same)	5 GB	CUDA `-g -1`	65.9	58.6	~2.8× Vulkan prefill
Qwen3 8B	(same)	5 GB	CUDA `-g -1 --no-thinking`	66.0	58.2	Same per-token rate; reasoning suppressed in chat template, so all decoded tokens are visible answer
Qwen3 8B	(same)	5 GB	CUDA `-g -1 --tq`	65.9	58.4	3-bit KV → 40 960 ctx; 17 t/s @ 8K, 10 t/s @ 16K
Qwen3 8B	(same)	5 GB	CUDA `-g -1 --tq --no-thinking`	66.1	58.1	Same per-token rate as `--tq` alone; reasoning suppressed
Qwen3-Coder 30B-A3B (MoE)	Qwen	17 GB	CPU	15.1	21.2	128 experts / 8 active
Qwen3-Coder 30B-A3B (MoE)	(same)	17 GB	CPU `--tq`	12.0	21.1	3-bit KV
Qwen3-Coder 30B-A3B (MoE)	(same)	17 GB	Vulkan `-g -1` (hybrid)	1.0	10.2	⚠ output incoherent on this path — under investigation
Qwen3-Coder 30B-A3B (MoE)	(same)	17 GB	CUDA `-g -1` (hybrid)	13.9	22.7	29 GPU + 19 CPU layers (auto), ~2.2× Vulkan decode
Llama-4 Scout 17B-16E (MoE)	meta-llama	61 GB	CPU	1.9	3.9	48 layers, 17B active params; split GGUF (Q4_K_M)
Llama-4 Scout 17B-16E (MoE)	(same)	61 GB	CUDA `-g -1` (hybrid)	0.9	2.1	7 GPU + 41 CPU layers — model dwarfs the 12 GB card, PCIe cost > GPU speedup so CPU-only wins here
Qwen3.6-35B-A3B (GDN+MoE)	unsloth	22 GB	CPU	4.3	7.8	hybrid GDN/attn, 256 experts / 8 active
Qwen3.6-35B-A3B (GDN+MoE)	(same)	22 GB	CUDA `-g -1` (hybrid)	11.2	23.8	10 attn + 30 GDN on GPU; MoE auto-routed to CPU, batched-expert dispatch (8 experts × 3 ops into 2 Parallel.For sweeps), shared expert kept on GPU and overlapped with the CPU routed loop

--backend auto (default) picks CUDA when available, sizing the GPU/CPU split from VRAM via TierPlanner; falls through to Vulkan only when CUDA isn't present. --tq enables 3-bit TurboQuant KV compression (CPU, Vulkan, CUDA; requires headDim ∈ {128, 256}). MoE runs on GPU (full-offload or partial hybrid) on both Vulkan and CUDA backends.

For hybrid SSM/attention models (qwen35moe), the CUDA backend keeps the attention KV cache, the 30 Gated-DeltaNet layers (conv1d + rank-1 outer-product recurrence), and the shared expert resident in VRAM; routed-expert dispatch auto-selects between an SLRU GPU cache and CPU mmap reads based on what fraction of experts can be cached at boot. Override with SHARPI_CPU_MOE=0|1.

Reasoning models

Models that emit <think>...</think> (Qwen3, DeepSeek-R1, SmolLM3, …) are detected automatically from their special tokens — no flag needed. The CLI dims the reasoning stream as it generates. Use --no-thinking to disable reasoning at the chat-template level, --hide-thinking to keep it on but hide the stream, and --max-thinking-tokens N to force-close runaway reasoning. Greedy decoding (--temp 0) on these models often loops, so the CLI warns and recommends --temp 0.6 --top-p 0.95 --top-k 20.

The API server surfaces reasoning per each protocol's convention: Anthropic /v1/messages emits a thinking content block before text; OpenAI /v1/chat/completions exposes reasoning_content alongside content (vLLM / DeepSeek style). Anthropic's thinking.budget_tokens and an OpenAI extension max_thinking_tokens both map to the same engine-side budget. Prior assistant turns in chat history have their <think> blocks stripped before templating (Qwen3 and friends are trained without them).

CLI examples

# CPU, single-turn, greedy
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf -p "What is 2+2?" --temp 0

# Full GPU offload (auto-picks CUDA on dense + full offload)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf -p "Write a quicksort in Python" --temp 0 -g -1

# MoE on CPU with 3-bit KV compression (5× less VRAM, full ctx)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Implement a BST in C#" --temp 0

# Interactive chat (no -p)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

# Speculative decoding (~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  -p "Write a binary search in Rust" --temp 0

# Reasoning model: stream shows dimmed <think>...</think>, then the answer
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf -g -1 --temp 0.6 --top-p 0.95 --top-k 20 \
  -p "What's 17 × 23?" --max-thinking-tokens 1024

# API server (OpenAI /v1/chat/completions + Anthropic /v1/messages, port 5000)
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  dotnet run --project src/SharpInference.Server -c Release

Image generation

Two pipelines, auto-detected from model filename. Benchmarked on AMD Zen 4

RTX 4070 Ti (CUDA backend, 4 denoising steps, 512×512 output). The CLI is a one-shot binary, so each invocation pays the full load + text-encoder warmup. The "cached" column is the steady-state cost when the same encoder weights stay resident — e.g., re-rendering inside the server or interactive loop after the first prompt.

Pipeline	Components (repo • file • size)	Per-run	Cached prompt	Notes
Z-Image-Turbo	DiT: jayn7/Z-Image-Turbo-GGUF `z_image_turbo-Q5_K_M.gguf` 5.5 GB Encoder: BennyDaBall/...-AbliteratedV1 `Z-Image-AbliteratedV1.Q5_K_M.gguf` 2.9 GB VAE + tokenizer: Tongyi-MAI/Z-Image-Turbo `vae/` `tokenizer/`	~108 s	~30 s	Most of the per-run cost is text-encoder warmup (~90 s); DiT ~4 s, VAE ~18 s once weights are hot. Output verified visually.
FLUX.1-schnell	DiT: city96/FLUX.1-schnell-gguf `flux1-schnell-Q4_K_S.gguf` ~7 GB Encoders + VAE: comfyanonymous/flux_text_encoders `clip_l.safetensors` + `t5xxl_fp16.safetensors` + `ae.safetensors`	—	—	4-step distilled; model not on this benchmark machine

Optional 4× upscale via Real-ESRGAN (RealESRGAN_x4plus.safetensors): runs on CUDA when available, falls back to bicubic.

CLI examples

# Z-Image-Turbo (auto-detects pipeline from filename containing "z_image")
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/z_image_turbo-Q5_K_M.gguf \
  --vae models/z-image-turbo/vae \
  --qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
  --qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
  -p "a serene mountain lake at sunrise" -W 1024 -H 1024 --steps 4 -o landscape.png

# FLUX.1-schnell
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/flux1-schnell-Q4_K_S.gguf \
  --vae models/flux/ae.safetensors \
  --clip-l models/flux/clip_l.safetensors --clip-tokenizer models/flux/tokenizer_clip.json \
  --t5xxl models/flux/t5xxl_fp16.safetensors --t5-tokenizer models/flux/tokenizer_t5.json \
  -p "a cinematic photograph of a mountain lake" -W 512 -H 512 --steps 4 -o out.png

# With 4× Real-ESRGAN upscale + blend
dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/z_image_turbo-Q5_K_M.gguf \
  --vae models/z-image-turbo/vae \
  --qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
  --qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
  --upscaler models/RealESRGAN_x4plus.safetensors --upscale-blend 0.8 \
  -p "a fox in autumn forest" -W 512 -H 512 --steps 4 -o fox.png

More

Architecture & algorithms: docs/SharpInference-Design.md
All CLI flags: sharpi-cli --help, sharpi-cli image --help
Model downloads: scripts/download-model.ps1 -Model <smollm2|qwen3-8b|qwen3-coder-30b-a3b|llama4-scout|z-image-turbo|realesrgan-x4|…>
Tests: dotnet test
NativeAOT publish: dotnet publish src/SharpInference.Cli -c Release -r win-x64

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
codebooks		codebooks
docs		docs
samples		samples
scripts		scripts
shaders		shaders
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
README.md		README.md
SharpInference.slnx		SharpInference.slnx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharpInference

Text generation

Reasoning models

CLI examples

Image generation

CLI examples

More

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SharpInference

Text generation

Reasoning models

CLI examples

Image generation

CLI examples

More

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages