Neurons

A from-scratch LLM inference engine and chat application. Built to understand how large language models actually work at the hardware level — using Metal/MLX directly rather than wrapping llama.cpp or Ollama.

What it is

Neurons is a full-stack local AI system:

compute/ — C++23 inference library. Implements the transformer forward pass from first principles: quantized matmul, RoPE, RMSNorm, KV cache, sampling. Pluggable backends (ComputeBackend interface).
service/ — gRPC inference server (neurons-service) + OpenAI-compatible HTTP endpoint. Runs on any machine on your network.
cli/ — Terminal interface. Chat, download models, manage nodes, start a server.
gui/ — Flutter macOS app. Chat UI, model browser, multi-node management, live tok/s stats.

The GUI never links C++ directly. Locally it calls libneurons_core.dylib over dart:ffi; against remote machines it uses gRPC. The same NeuronsClient interface covers both.

Feature highlights

Feature	GUI	CLI	gRPC
Multi-turn chat	✅	✅	✅
Streaming generation	✅	✅	✅
Live tok/s + token counts	✅	✅	✅
Model download from HuggingFace	✅	✅	✅
Model search + browser	✅	✅	✅
HuggingFace auth (gated models)	✅	✅	✅
Sampling params (temp, top-p, top-k, rep-penalty)	✅	✅	✅
Multi-session chat history (JSON persistence)	✅	✅	—
Multi-node management	✅	✅	—
OpenAI-compatible HTTP endpoint	—	✅	—
Remote log streaming	✅	—	✅
MCP server management (add/remove/list/push)	🚧	🚧	✅
MCP permission rules (global/session/chat scopes)	🚧	🚧	✅
MCP tool approval flow (always_ask / always_allow / always_deny)	🚧	🚧	✅

Screenshots

Supported models

Family	Example repos	Backend
Llama 2/3, TinyLlama	`mlx-community/Llama-3.2-3B-Instruct-4bit`	MLX
Mistral	`mlx-community/Mistral-7B-Instruct-v0.3-4bit`	MLX
Qwen2 / Qwen2.5 / Qwen3	`mlx-community/Qwen2.5-7B-Instruct-4bit`	MLX
Qwen3 MoE	`mlx-community/Qwen3-30B-A3B-4bit`, `mlx-community/Qwen3.6-35B-A3B-4bit`	MLX
Gemma / Gemma2 / Gemma3	`mlx-community/gemma-3-1b-it-qat-4bit`, `mlx-community/gemma-3-4b-it-qat-4bit`	MLX
fp16 / bf16 unquantized	any base HuggingFace safetensors repo	MLX

All models are downloaded directly from HuggingFace in their mlx-community MLX-quantized variants for Apple Silicon. CUDA and ROCm backends are on the roadmap.

Architecture

graph TD
    GUI["Flutter GUI (macOS) — dart:ffi · gRPC"]
    Core["libneurons_core — C FFI · NeuronsServiceImpl"]
    LM["LanguageModel::load()"]

    Llama["LlamaModel — Llama 2/3 · Mistral · Qwen2/2.5/3"]
    Gemma["GemmaModelMLX — Gemma / Gemma2 / Gemma3"]
    Qwen3Moe["Qwen3MoeModelMLX — Qwen3 MoE"]

    Backend["ComputeBackend (interface)"]
    MLX["MLXBackend — Apple Silicon · Metal · mx::compile"]
    Roadmap["CUDA / ROCm (roadmap)"]

    GUI --> Core
    Core --> LM
    LM --> Llama
    LM --> Gemma
    LM --> Qwen3Moe
    Llama & Gemma & Qwen3Moe --> Backend
    Backend --> MLX
    Backend -.-> Roadmap

Prerequisites

macOS (Apple Silicon) — primary platform

# Xcode command line tools
xcode-select --install

# Homebrew dependencies
brew install cmake grpc protobuf

# Flutter SDK
# https://docs.flutter.dev/get-started/install/macos

Linux / Windows — CUDA/ROCm backends are on the roadmap. The gRPC service builds today; MLX inference requires Apple Silicon.

Download (pre-built macOS app)

Pre-built .dmg files for Apple Silicon are attached to each GitHub Release.

Download Neurons-<version>-arm64.dmg from the latest release.
Open the DMG and drag Neurons.app to /Applications.
On first launch macOS will block the app (unsigned binary). To allow it:
- Right-click Neurons.app → Open → Open in the dialog, or
- Run once in Terminal: xattr -dr com.apple.quarantine /Applications/Neurons.app

The app requires macOS 14 (Sonoma) or later on Apple Silicon.

Building

git clone https://github.com/dexwritescode/neurons.git
cd neurons

All C++ + Flutter targets are driven from the root Makefile:

Integration tests require model files and skip automatically when absent — see docs/models.md for the full list and download commands.

make help          # list all targets

make all           # build compute + CLI + service
make cli           # CLI only
make service       # gRPC service only
make dylib         # libneurons_core.dylib (Flutter FFI dependency)

make tests         # build and run all C++ tests
make flutter-test  # run Flutter widget + unit tests

make run           # build dylib + launch Flutter app (debug)
make gui           # build dylib + Flutter macOS release app

Performance

Measured on Apple Silicon (M2 Max 64 GB), greedy decoding (temperature=0), release build:

Model	Params	Active params	tok/s
TinyLlama 1.1B 4-bit	1.1B	1.1B	~265
Gemma 3 1B 4-bit	1B	1B	~190
Gemma 3 4B 4-bit (QAT)	4B	4B	~60
Llama-3.1 8B 4-bit	8B	8B	~61
Mistral 7B 4-bit	7B	7B	~57
Qwen3.6 35B-A3B 4-bit	35B	3.6B	~77

MoE models run near the speed of a dense 3-4B model because only a small fraction of parameters are active per token. Decode uses GPU-pipelined generation with mx::compile — the first generation per session incurs a one-time compilation cost.

Quick start

Download and run a model in the terminal

# Build the CLI
make cli

# Search for models
./build/bin/neurons search "qwen 3b"

# Download one
./build/bin/neurons download mlx-community/Qwen2.5-3B-Instruct-4bit

# Chat
./build/bin/neurons chat mlx-community/Qwen2.5-3B-Instruct-4bit

Run the GUI

make run

The app opens on the Chats screen. Go to Browse to search HuggingFace, download a model, then return to Chats — the model loads automatically when selected.

Run as a server (OpenAI-compatible)

# Start with an HTTP endpoint on port 8080
./build/bin/cli server --http-port 8080 --model mlx-community/Qwen2.5-3B-Instruct-4bit

# Point any OpenAI client at it
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"stream":true}'

Works with Cursor's "local model" setting, Continue.dev, and any client that supports the OpenAI chat completions API.

CLI reference

neurons chat    <model>          Interactive multi-turn chat
neurons load    <model>          One-shot inference with --prompt
neurons search  <query>          Search HuggingFace
neurons download <repo-id>       Download a model
neurons list                     List local models
neurons server  [--http-port N]  Start gRPC + HTTP server
neurons node    add/remove/list  Manage remote nodes
neurons token   set/clear        HuggingFace auth token
neurons config  show/set         Configuration

Remote nodes

Neurons supports connecting multiple machines as inference nodes. Each node runs neurons-service; the GUI and CLI connect to all of them and route requests.

# On the remote machine
neurons server --grpc-port 50051 --http-port 8080

# On your laptop — add the node in the GUI (Nodes tab)
# or via CLI:
neurons node add my-server grpc://192.168.1.10:50051
neurons node use my-server

Project layout

Neurons/
  compute/    C++ inference library (backends, models, tokenizer, sampler)
  cli/        CLI binary — links compute directly
  service/    gRPC server + OpenAI HTTP server + C FFI surface
  gui/        Flutter macOS app
  models/     HuggingFace client (search, download, metadata)
  Makefile    All build targets

Roadmap

Phase	Status	Description
A–E	✅	MLX backend, KV cache, sampling, Llama/Gemma/Qwen/Mistral
F	✅	Model family support (fp16/bf16, Gemma3, Qwen2.5, Qwen3, Qwen3 MoE)
G–I	✅	gRPC service, Flutter GUI, CLI, OpenAI HTTP, logging
O	✅	MLX performance — GPU-pipelined decode, mx::compile, batched prefill
J	🚧	File attach + RAG (embeddings, sqlite-vec)
K	🚧	Multi-node: routing, speculative decoding, failover
L.1–2	✅	MCP client runtime — stdio/SSE transport, JSON-RPC 2.0, McpManager
L.3	✅	MCP gRPC extensions — server/permission RPCs, tool approval flow
L.4–6	🚧	MCP GUI — settings, permissions table, live approval prompt
L.8	🚧	Built-in MCP servers (filesystem, shell)
B/C	🚧	CUDA and ROCm backends

Contributing

The project is structured so each layer can be understood and modified independently:

Add a new model family — implement LanguageModel in compute/, add to the load() factory, write an integration test.
Add a new backend — implement ComputeBackend, wire into BackendFactory.
Add a new CLI command — add a command file in cli/src/cli/commands/, register in main.cpp.
Extend the GUI — gui/lib/ is a standard Flutter project; NeuronsClient is the interface to mock for tests.

All three interfaces (GUI, CLI, gRPC) must be updated together for any user-facing feature.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neurons

What it is

Feature highlights

Screenshots

Supported models

Architecture

Prerequisites

Download (pre-built macOS app)

Building

Performance

Quick start

Download and run a model in the terminal

Run the GUI

Run as a server (OpenAI-compatible)

CLI reference

Remote nodes

Project layout

Roadmap

Contributing

License

About

Uh oh!

Releases 25

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
cli		cli
compute		compute
docs		docs
gui		gui
models		models
scripts		scripts
service		service
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Neurons

What it is

Feature highlights

Screenshots

Supported models

Architecture

Prerequisites

Download (pre-built macOS app)

Building

Performance

Quick start

Download and run a model in the terminal

Run the GUI

Run as a server (OpenAI-compatible)

CLI reference

Remote nodes

Project layout

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 25

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages