Lattice

Pure Rust inference engine for transformer models on Apple Silicon, with a native macOS app.

No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph in Rust: weight loading, tokenization, forward pass, vector operations, quantization, and LoRA training. Hand-written Metal shaders accelerate inference on Apple Silicon. SIMD kernels (AVX2 on x86, NEON on ARM) handle the CPU path.

What is Lattice

Lattice is two things in one repo.

A Rust inference library. Five published crates covering embeddings, generation, quantization, LoRA fine-tuning, and optimal transport. Use lattice-embed as a library dependency, or run the lattice binary for interactive chat and an OpenAI-compatible HTTP server.

Lattice Studio: a native macOS app. A SwiftUI instrument panel that drives the Rust engine via CLI subprocesses. Train LoRA adapters with a live loss oscilloscope, quantize models with a before/after comparison, hot-swap adapters in chat with zero reload, and manage your model library from a single window.

Capabilities


Pure Rust compute	Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA.
Metal GPU backend	Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform.
Generation models	Qwen3.5-0.8B / 2B, Qwen3.6-35B-A3B (MoE), Qwen3.6-27B. Hybrid GatedDeltaNet + GQA architecture.
Embedding models	9 models: BGE, E5, MiniLM, Qwen3-Embedding families. Auto-download for 7 BERT-family variants.
Three tokenizers	WordPiece, SentencePiece, BPE. No Hugging Face tokenizers C extension.
Quantization	Q8, Q4, and QuaRot (rotation-based 4-bit). No other engine runs Q4 + LoRA hot-swap on Qwen3.5.
LoRA	Inference hook, hot-swap with no reload, PEFT safetensors format, training via `lattice-tune`.
HTTP API	OpenAI-compatible `/v1/chat/completions` via `lattice serve`.
Safetensors native	Memory-mapped weight loading. Single-file and sharded checkpoints.
KV cache	Incremental decoding with key-value caching.
Speculative decoding	Draft-model acceleration on the CPU path.
Grammar decoding	Constrained output via a pushdown automaton. OpenAI string-level stop sequences.
MRL support	Matryoshka truncation for Qwen3-Embedding models (output dimension >= 32).
LRU cache	`CachedEmbeddingService` with sharded in-memory cache and hit/miss stats.
Knowledge distillation	Train small models from Claude/GPT/Gemini teacher soft labels via `lattice-tune`.
Optimal transport	Sinkhorn-Knopp solver for embedding drift detection via `lattice-transport`.

Benchmarks

Measured on Apple M2 Max, Qwen3.5-0.8B, slope method (token throughput excluding prompt prefill and model load). Greedy decoding, median of 5 runs.

Context	Lattice (Q8, f16 head)	Ollama (Q8_0)	MLX (Q8 g64, AMX)	Lattice vs Ollama
64 tok	187 tok/s	93	265	2.0x
128 tok	171 tok/s	92	263	1.9x
256 tok	146 tok/s	88	260	1.6x

MLX uses Apple's private MPS/AMX matrix engines. Lattice uses the public Metal compute API, the same tier as Ollama. MLX decodes faster than Lattice at raw throughput. Lattice's edge is portability (pure Rust, zero Python, zero framework) plus capabilities neither Ollama nor MLX provide for this model family:

Capability	Lattice	MLX	Ollama
QuaRot 4-bit (rotation-based quant)	yes	no	no
Q4 + LoRA hot-swap (no reload)	yes	no	no
Pure Rust, zero Python or framework	yes	no	no

PPL parity confirmed: Lattice 20.60, MLX 20.67 on wikitext-2 (2048 tokens). Reproduce: ./scripts/bench_context_scaling.sh

Crates

Crate	Description	crates.io
`lattice-embed`	Embedding service: `EmbeddingService` trait, `NativeEmbeddingService`, `CachedEmbeddingService`, SIMD cosine/dot/euclidean, backfill
`lattice-inference`	Transformer kernel: safetensors loading, BERT/BGE/Qwen3 forward pass, three tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding, quantization
`lattice-fann`	Fast neural network primitives: `NetworkBuilder`, pre-allocated layers, zero-alloc forward pass, backprop trainer
`lattice-tune`	Training: knowledge distillation pipeline, dataset management, LoRA adapter management, model registry
`lattice-transport`	Optimal transport: Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection

The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies and can be used standalone.

Quick Start: Embeddings

[dependencies]
lattice-embed = "0.3"
tokio = { version = "1", features = ["full"] }

use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service = NativeEmbeddingService::default();

    // Single embedding (BGE-small-en-v1.5, 384 dimensions)
    let embedding = service
        .embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
        .await?;

    println!("Dimensions: {}", embedding.len()); // 384

    // Batch
    let texts = vec![
        "First document".to_string(),
        "Second document".to_string(),
    ];
    let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;

    // SIMD-accelerated similarity
    let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
    println!("Similarity: {:.4}", similarity);

    Ok(())
}

Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models (or $LATTICE_MODEL_CACHE).

GPU acceleration (macOS)

lattice-embed = { version = "0.4", features = ["metal-gpu"] }

Cross-platform GPU

lattice-embed = { version = "0.4", features = ["wgpu-gpu"] }

Quick Start: CLI

Build from source (requires Rust 1.80+ and, for Metal, macOS 14+):

git clone https://github.com/ohdearquant/lattice
cd lattice

# CLI binary (chat + serve)
cargo build --release -p lattice-inference --bin lattice

# Interactive chat
./target/release/lattice chat --model ~/.lattice/models/qwen3.5-0.8b

# OpenAI-compatible HTTP server
./target/release/lattice serve --model ~/.lattice/models/qwen3.5-0.8b --port 8080

With Metal GPU (macOS only):

cargo build --release -p lattice-inference --bin lattice --features metal-gpu,f16

HTTP API

lattice serve exposes an OpenAI-compatible endpoint:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-0.8b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Lattice Studio (macOS App)

Lattice Studio is a native SwiftUI app for macOS 14+. It wraps the Rust engine in an instrument-panel interface: live loss curves, before/after quantization comparisons, LoRA hot-swap in chat, and a model library manager.

Build and package

# Requires: Xcode (Swift 6.3+), Rust toolchain
./apps/macos/scripts/package-app.sh

This builds the Swift frontend, compiles all Rust engine binaries in release mode, and assembles a self-contained LatticeStudio.app bundle at apps/macos/dist/. A .dmg and .zip are also produced. The packaged app needs no Rust toolchain on the recipient machine.

# Skip Swift rebuild (use existing build output)
./apps/macos/scripts/package-app.sh --skip-build

# Skip Cargo rebuild
./apps/macos/scripts/package-app.sh --skip-cargo

Install

Drag LatticeStudio.app from the .dmg to /Applications.

The app is ad-hoc signed. On first launch, right-click and choose "Open" to bypass Gatekeeper, then click "Open" in the dialog. macOS remembers the exception for subsequent launches. Alternatively:

xattr -dr com.apple.quarantine /Applications/LatticeStudio.app

What is in Lattice Studio

Models (cmd-1). A table of all local models and adapters under ~/.lattice/models, with file manifests, config details (18 GatedDeltaNet + 6 GQA layers called out explicitly), and Download/Verify/Reveal in Finder actions.

Train (cmd-2). LoRA fine-tuning with a live loss oscilloscope. A 56-point hero loss numeral ticks digit-by-digit as the run progresses. Scrub the loss curve to freeze all metric readouts (lr, grad-norm, tok/s, ETA) to any step. Results stream as line-delimited JSON from the engine.

Quantize (cmd-3). Q4 or QuaRot quantization. A side-by-side comparison shows size, bits, and estimated PPL before and after. True-scale bars animate to show the compression ratio. A gate pill states the result: PASS, WARN, or FAIL.

Chat (cmd-4). Side-by-side base vs. base+adapter columns streaming the same prompt in parallel. The adapter selector is a console fader: slide it and the engine swaps the adapter with no reload. A "0 ms reload" stamp confirms it. Each column shows live tok/s and time-to-first-token.

Data (cmd-5). Paste raw text or load files. Preview the derived {prompt, completion} pairs as a JSONL table, validate token counts (using the same tokenizer the engine uses), and export straight into the Train dataset field.

Runs (cmd-6). Archive of every training and quantization run. Select a row to reopen its exact config and view the frozen loss curve.

The command bar (cmd-K) runs everything from a single keyboard shortcut: train qwen3.5 r8 or quantize quarot parse into argument chips and fire a run.

Supported Models

Embedding models

Variant	HuggingFace ID	Dims	Max tokens	Auto-download
`BgeSmallEnV15`	`BAAI/bge-small-en-v1.5`	384	512	yes
`BgeBaseEnV15`	`BAAI/bge-base-en-v1.5`	768	512	yes
`BgeLargeEnV15`	`BAAI/bge-large-en-v1.5`	1024	512	yes
`MultilingualE5Small`	`intfloat/multilingual-e5-small`	384	512	yes
`MultilingualE5Base`	`intfloat/multilingual-e5-base`	768	512	yes
`AllMiniLmL6V2`	`sentence-transformers/all-MiniLM-L6-v2`	384	256	yes
`ParaphraseMultilingualMiniLmL12V2`	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384	128	yes
`Qwen3Embedding0_6B`	`Qwen/Qwen3-Embedding-0.6B`	1024	8192	local dir only
`Qwen3Embedding4B`	`Qwen/Qwen3-Embedding-4B`	2560*	8192	local dir only

*Qwen3-Embedding-4B and 0.6B support MRL truncation to any dimension >= 32. BGE v1.5 uses CLS pooling. E5 and MiniLM use mean pooling. E5 embed_passage() applies the "passage: " prefix automatically.

Generation models (local files required)

Config preset	Description
`Qwen35Config::qwen35_0_8b`	24 layers, 1024 hidden, 1 MTP layer. Base decode shipped; MTP experimental.
`Qwen35Config::qwen35_2b`	24 layers, 2048 hidden, dense FFN, tied embeddings.
`Qwen35Config::qwen36_35b_a3b`	40 layers, MoE 256 experts top-8. Config and weight loader supported.
`Qwen35Config::qwen36_27b`	64 layers, 5120 hidden, dense FFN.

The Qwen3.5 architecture uses a hybrid of 18 GatedDeltaNet layers and 6 GQA attention layers. Lattice is the only open-source engine that correctly runs this hybrid recurrence at Q4 on Apple Silicon.

Model Selection (Embeddings)

use lattice_embed::EmbeddingModel;

// Fast English retrieval
let model = EmbeddingModel::BgeSmallEnV15;   // 384-dim, fastest, auto-download

// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15;    // 768-dim, auto-download

// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15;   // 1024-dim, auto-download

// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base;  // 768-dim, 100+ languages

// Long context + multilingual (local files required)
let model = EmbeddingModel::Qwen3Embedding0_6B;  // 1024-dim, 8K context

// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;

Vector Operations

lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:

use lattice_embed::utils;

// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);

utils::normalize(&mut vector);  // in-place L2 normalization

// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);

Measured performance on normalized vectors (internal benchmarks, subject to hardware):

Operation	Scalar	SIMD
cosine similarity (384-dim)	~650 ns	~90 ns
cosine similarity (768-dim)	~1300 ns	~180 ns
cosine similarity (1024-dim)	~1700 ns	~240 ns

Feature Flags

lattice-embed

Feature	Default	Description
`native`	yes	Pure Rust inference via `lattice-inference`
`metal-gpu`	no	Metal GPU acceleration (macOS)
`avx512`	no	AVX-512 SIMD kernels (requires nightly)

lattice-inference

Feature	Default	Description
`f16`	no	Half-precision weights
`metal-gpu`	no	Metal compute backend
`wgpu-gpu`	no	WGPU cross-platform GPU backend
`download`	yes	HuggingFace weight download with checksum verification
`backfill`	no	Re-embedding coordinator (requires `rusqlite`)

Architecture

Application
    |
    v
lattice-embed          (public API: embedding service, SIMD distance ops, LRU cache)
    |
    v
lattice-inference      (transformer kernel: BERT/Qwen3 forward pass, tokenizers, weights)
    |
    +---> CPU (primary)      Metal (macOS)     WGPU (fallback)
          AVX2/NEON kernels   Apple Silicon      Vulkan/DX12


lattice-fann           (standalone: tiny network primitives, <5ms CPU inference)
lattice-transport      (standalone: optimal transport math, Wasserstein distances)
lattice-tune           (depends on fann + inference: LoRA, distillation, model registry)

Running Benchmarks

Embedding throughput

cargo bench --package lattice-embed

Metal GPU decode (macOS only, requires model weights)

cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode

Context scaling (Qwen3.5-0.8B vs Ollama vs MLX)

./scripts/bench_context_scaling.sh

Performance depends on hardware, model size, batch size, and sequence length. Run benchmarks on your target hardware for representative numbers.

Documentation

Architecture: crate dependency graph, design decisions, stability tiers
Models: full model support matrix, attention variants, inference features
Getting started: step-by-step setup guide
Examples: code samples for common tasks
ADR directory: docs/adr/

Publishing

Leaf crates publish first: inference then fann then transport (wait 30s) then embed (wait 30s) then tune.

make publish

License

Apache-2.0. See LICENSE.

Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
.githooks		.githooks
.github		.github
apps/macos		apps/macos
crates		crates
docs		docs
scripts		scripts
.clippy.toml		.clippy.toml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
deno.json		deno.json
deny.toml		deny.toml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Lattice

What is Lattice

Capabilities

Benchmarks

Crates

Quick Start: Embeddings

GPU acceleration (macOS)

Cross-platform GPU

Quick Start: CLI

HTTP API

Lattice Studio (macOS App)

Build and package

Install

What is in Lattice Studio

Supported Models

Embedding models

Generation models (local files required)

Model Selection (Embeddings)

Vector Operations

Feature Flags

lattice-embed

lattice-inference

Architecture

Running Benchmarks

Embedding throughput

Metal GPU decode (macOS only, requires model weights)

Context scaling (Qwen3.5-0.8B vs Ollama vs MLX)

Documentation

Publishing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages