Get local AI running with guided setup instead of manual runtime wiring.
Ignite is a local runtime manager for GGUF models and llama.cpp. It handles the work between "I have a GPU" and "I want an OpenAI-compatible local endpoint": engine setup, hardware-aware model discovery, downloads, model configuration, process management, swapping, and runtime visibility. No Docker. No Python runtime. One Go binary with the web UI embedded.
Ignite v2 is a complete rewrite in Go. The original Python/Docker version is archived on the v1-archive branch.
Running local models today means picking a llama.cpp build, compiling it with the right CUDA flags, finding a GGUF that fits your VRAM, writing config, managing processes, and figuring out model swapping. If you have done it before, it takes an afternoon. If you have not, it can take a weekend.
Build Ignite, run it, and open the local web UI. Ignite detects your GPUs, helps build llama.cpp, shows GGUF models that fit your hardware, downloads files from Hugging Face, and exposes an OpenAI-compatible API at localhost:8091/v1.
After setup, Ignite manages your inference stack:
- OpenAI-compatible API - drop-in endpoint for apps that speak OpenAI-style chat, completions, and embeddings
- Automatic model loading and swapping - request a configured model by ID or alias and Ignite loads it if it is not already running
- Multi-GPU support - assign models to specific GPUs and run models on different GPUs concurrently
- Model discovery - search Hugging Face GGUF repos and see hardware fit badges before downloading
- Engine management - clone, build, update, and switch llama.cpp backends from the UI
- Web dashboard - GPU monitoring, loaded model status, recent activity, endpoint snippets, runtime traffic, config, and playground
Everything runs locally. Ignite uses native llama.cpp subprocesses and keeps config, logs, state, model files, and backend checkouts on your machine.
- NVIDIA GPU
- CUDA toolkit for building/running CUDA llama.cpp builds
- Go 1.24+
- Node.js 20+ and
npmfor building the embedded web UI from source
git clone https://github.com/Spadav/Ignite.git
cd Ignite
make build
./igniteFirst launch walks you through setup. After that, open:
http://localhost:8091
The OpenAI-compatible endpoint is:
http://localhost:8091/v1
sudo cp ignite /usr/local/bin/
sudo tee /etc/systemd/system/ignite.service << 'EOF'
[Unit]
Description=Ignite
After=network.target
[Service]
ExecStart=/usr/local/bin/ignite --config /path/to/ignite.yaml
Restart=always
User=your-username
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now igniteIgnite manages llama.cpp as native subprocesses. When a request comes in for a configured model:
- Resolves the model ID or alias from config
- Loads the model if it is not already running
- Applies runtime-group swap rules for models in the same group
- Starts
llama-serverwith the configured flags, GPU assignment, and model file - Waits for health checks, proxies the request, and records request/response timing
- Tracks idle time and unloads models after the configured TTL
Models on different GPUs can run concurrently. Models in the same runtime group can swap each other out depending on group settings.
Ignite uses a single YAML config file. The web UI reads and writes it, and you can also edit it by hand.
listen: "0.0.0.0:8091"
backends:
mainline:
path: ./llama-backends/mainline
binary: build-ignite/bin/llama-server
buildDir: build-ignite
repo: https://github.com/ggml-org/llama.cpp
activeBackend: mainline
modelsPath: ./models
mmprojectsPath: ./models/mmproj
ttl:
global: 600
models:
my-model:
family: My Models
profile: Default
file: some-model-Q4_K_M.gguf
gpu: GPU-abc123
args: >-
--jinja -ngl 99 -c 32768 -fa on
--cache-type-k q4_0 --cache-type-v q4_0
--split-mode none --main-gpu 0
aliases:
- default
groups:
main:
swap: true
persistent: true
members:
- my-modelSee ignite.example.yaml for a complete safe default.
| Page | Purpose |
|---|---|
| Dashboard | Live GPU stats, loaded models, recent requests, and endpoint info |
| Models | Local GGUF library, Hugging Face discovery with hardware fit badges, and downloads |
| Config | Per-model settings, launch args, GPU assignment, aliases, and runtime groups |
| Runtime | Live request/response traffic viewer with timing and token counts |
| Engines | llama.cpp backend management: clone, build, update, and inspect |
| Playground | Test configured models with request options, image input, response parsing, and expert JSON |
| Settings | Global paths, ports, TTL, health checks, about links, and update notification |
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "my-model", "messages": [{"role": "user", "content": "hello"}]}'
curl http://localhost:8091/v1/modelsRequest any configured model by model ID or alias. Ignite forwards the request to the appropriate llama.cpp server after loading the model if needed.
Check out Timbre - a local voice gateway with OpenAI-compatible endpoints and swappable backends.
make backend
make frontend
make test
make buildmake build builds the web UI first, embeds it into the Go binary, and writes ./ignite.
Ignite v2 is a complete rewrite. The original version is archived on the v1-archive branch.

