A from-scratch LLM inference engine and chat application. Built to understand how large language models actually work at the hardware level — using Metal/MLX directly rather than wrapping llama.cpp or Ollama.
Neurons is a full-stack local AI system:
compute/— C++23 inference library. Implements the transformer forward pass from first principles: quantized matmul, RoPE, RMSNorm, KV cache, sampling. Pluggable backends (ComputeBackendinterface).service/— gRPC inference server (neurons-service) + OpenAI-compatible HTTP endpoint. Runs on any machine on your network.cli/— Terminal interface. Chat, download models, manage nodes, start a server.gui/— Flutter macOS app. Chat UI, model browser, multi-node management, live tok/s stats.
The GUI never links C++ directly. Locally it calls libneurons_core.dylib over dart:ffi; against remote machines it uses gRPC. The same NeuronsClient interface covers both.
| Feature | GUI | CLI | gRPC |
|---|---|---|---|
| Multi-turn chat | ✅ | ✅ | ✅ |
| Streaming generation | ✅ | ✅ | ✅ |
| Live tok/s + token counts | ✅ | ✅ | ✅ |
| Model download from HuggingFace | ✅ | ✅ | ✅ |
| Model search + browser | ✅ | ✅ | ✅ |
| HuggingFace auth (gated models) | ✅ | ✅ | ✅ |
| Sampling params (temp, top-p, top-k, rep-penalty) | ✅ | ✅ | ✅ |
| Multi-session chat history (JSON persistence) | ✅ | ✅ | — |
| Multi-node management | ✅ | ✅ | — |
| OpenAI-compatible HTTP endpoint | — | ✅ | — |
| Remote log streaming | ✅ | — | ✅ |
| MCP server management (add/remove/list/push) | 🚧 | 🚧 | ✅ |
| MCP permission rules (global/session/chat scopes) | 🚧 | 🚧 | ✅ |
| MCP tool approval flow (always_ask / always_allow / always_deny) | 🚧 | 🚧 | ✅ |
| Family | Example repos | Backend |
|---|---|---|
| Llama 2/3, TinyLlama | mlx-community/Llama-3.2-3B-Instruct-4bit |
MLX |
| Mistral | mlx-community/Mistral-7B-Instruct-v0.3-4bit |
MLX |
| Qwen2 / Qwen2.5 / Qwen3 | mlx-community/Qwen2.5-7B-Instruct-4bit |
MLX |
| Qwen3 MoE | mlx-community/Qwen3-30B-A3B-4bit, mlx-community/Qwen3.6-35B-A3B-4bit |
MLX |
| Gemma / Gemma2 / Gemma3 | mlx-community/gemma-3-1b-it-qat-4bit, mlx-community/gemma-3-4b-it-qat-4bit |
MLX |
| fp16 / bf16 unquantized | any base HuggingFace safetensors repo | MLX |
All models are downloaded directly from HuggingFace in their mlx-community MLX-quantized variants for Apple Silicon. CUDA and ROCm backends are on the roadmap.
graph TD
GUI["Flutter GUI (macOS) — dart:ffi · gRPC"]
Core["libneurons_core — C FFI · NeuronsServiceImpl"]
LM["LanguageModel::load()"]
Llama["LlamaModel — Llama 2/3 · Mistral · Qwen2/2.5/3"]
Gemma["GemmaModelMLX — Gemma / Gemma2 / Gemma3"]
Qwen3Moe["Qwen3MoeModelMLX — Qwen3 MoE"]
Backend["ComputeBackend (interface)"]
MLX["MLXBackend — Apple Silicon · Metal · mx::compile"]
Roadmap["CUDA / ROCm (roadmap)"]
GUI --> Core
Core --> LM
LM --> Llama
LM --> Gemma
LM --> Qwen3Moe
Llama & Gemma & Qwen3Moe --> Backend
Backend --> MLX
Backend -.-> Roadmap
macOS (Apple Silicon) — primary platform
# Xcode command line tools
xcode-select --install
# Homebrew dependencies
brew install cmake grpc protobuf
# Flutter SDK
# https://docs.flutter.dev/get-started/install/macosLinux / Windows — CUDA/ROCm backends are on the roadmap. The gRPC service builds today; MLX inference requires Apple Silicon.
Pre-built .dmg files for Apple Silicon are attached to each GitHub Release.
- Download
Neurons-<version>-arm64.dmgfrom the latest release. - Open the DMG and drag Neurons.app to
/Applications. - On first launch macOS will block the app (unsigned binary). To allow it:
- Right-click
Neurons.app→ Open → Open in the dialog, or - Run once in Terminal:
xattr -dr com.apple.quarantine /Applications/Neurons.app
- Right-click
The app requires macOS 14 (Sonoma) or later on Apple Silicon.
git clone https://github.com/dexwritescode/neurons.git
cd neuronsAll C++ + Flutter targets are driven from the root Makefile:
Integration tests require model files and skip automatically when absent — see docs/models.md for the full list and download commands.
make help # list all targets
make all # build compute + CLI + service
make cli # CLI only
make service # gRPC service only
make dylib # libneurons_core.dylib (Flutter FFI dependency)
make tests # build and run all C++ tests
make flutter-test # run Flutter widget + unit tests
make run # build dylib + launch Flutter app (debug)
make gui # build dylib + Flutter macOS release appMeasured on Apple Silicon (M2 Max 64 GB), greedy decoding (temperature=0), release build:
| Model | Params | Active params | tok/s |
|---|---|---|---|
| TinyLlama 1.1B 4-bit | 1.1B | 1.1B | ~265 |
| Gemma 3 1B 4-bit | 1B | 1B | ~190 |
| Gemma 3 4B 4-bit (QAT) | 4B | 4B | ~60 |
| Llama-3.1 8B 4-bit | 8B | 8B | ~61 |
| Mistral 7B 4-bit | 7B | 7B | ~57 |
| Qwen3.6 35B-A3B 4-bit | 35B | 3.6B | ~77 |
MoE models run near the speed of a dense 3-4B model because only a small fraction of parameters are active per token. Decode uses GPU-pipelined generation with mx::compile — the first generation per session incurs a one-time compilation cost.
# Build the CLI
make cli
# Search for models
./build/bin/neurons search "qwen 3b"
# Download one
./build/bin/neurons download mlx-community/Qwen2.5-3B-Instruct-4bit
# Chat
./build/bin/neurons chat mlx-community/Qwen2.5-3B-Instruct-4bitmake runThe app opens on the Chats screen. Go to Browse to search HuggingFace, download a model, then return to Chats — the model loads automatically when selected.
# Start with an HTTP endpoint on port 8080
./build/bin/cli server --http-port 8080 --model mlx-community/Qwen2.5-3B-Instruct-4bit
# Point any OpenAI client at it
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"stream":true}'Works with Cursor's "local model" setting, Continue.dev, and any client that supports the OpenAI chat completions API.
neurons chat <model> Interactive multi-turn chat
neurons load <model> One-shot inference with --prompt
neurons search <query> Search HuggingFace
neurons download <repo-id> Download a model
neurons list List local models
neurons server [--http-port N] Start gRPC + HTTP server
neurons node add/remove/list Manage remote nodes
neurons token set/clear HuggingFace auth token
neurons config show/set Configuration
Neurons supports connecting multiple machines as inference nodes. Each node runs neurons-service; the GUI and CLI connect to all of them and route requests.
# On the remote machine
neurons server --grpc-port 50051 --http-port 8080
# On your laptop — add the node in the GUI (Nodes tab)
# or via CLI:
neurons node add my-server grpc://192.168.1.10:50051
neurons node use my-serverNeurons/
compute/ C++ inference library (backends, models, tokenizer, sampler)
cli/ CLI binary — links compute directly
service/ gRPC server + OpenAI HTTP server + C FFI surface
gui/ Flutter macOS app
models/ HuggingFace client (search, download, metadata)
Makefile All build targets
| Phase | Status | Description |
|---|---|---|
| A–E | ✅ | MLX backend, KV cache, sampling, Llama/Gemma/Qwen/Mistral |
| F | ✅ | Model family support (fp16/bf16, Gemma3, Qwen2.5, Qwen3, Qwen3 MoE) |
| G–I | ✅ | gRPC service, Flutter GUI, CLI, OpenAI HTTP, logging |
| O | ✅ | MLX performance — GPU-pipelined decode, mx::compile, batched prefill |
| J | 🚧 | File attach + RAG (embeddings, sqlite-vec) |
| K | 🚧 | Multi-node: routing, speculative decoding, failover |
| L.1–2 | ✅ | MCP client runtime — stdio/SSE transport, JSON-RPC 2.0, McpManager |
| L.3 | ✅ | MCP gRPC extensions — server/permission RPCs, tool approval flow |
| L.4–6 | 🚧 | MCP GUI — settings, permissions table, live approval prompt |
| L.8 | 🚧 | Built-in MCP servers (filesystem, shell) |
| B/C | 🚧 | CUDA and ROCm backends |
The project is structured so each layer can be understood and modified independently:
- Add a new model family — implement
LanguageModelincompute/, add to theload()factory, write an integration test. - Add a new backend — implement
ComputeBackend, wire intoBackendFactory. - Add a new CLI command — add a command file in
cli/src/cli/commands/, register inmain.cpp. - Extend the GUI —
gui/lib/is a standard Flutter project;NeuronsClientis the interface to mock for tests.
All three interfaces (GUI, CLI, gRPC) must be updated together for any user-facing feature.
MIT — see LICENSE.


