Pure Rust inference engine for transformer models on Apple Silicon, with a native macOS app.
No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph in Rust: weight loading, tokenization, forward pass, vector operations, quantization, and LoRA training. Hand-written Metal shaders accelerate inference on Apple Silicon. SIMD kernels (AVX2 on x86, NEON on ARM) handle the CPU path.
Lattice is two things in one repo.
A Rust inference library. Five published crates covering embeddings, generation, quantization,
LoRA fine-tuning, and optimal transport. Use lattice-embed as a library dependency, or run
the lattice binary for interactive chat and an OpenAI-compatible HTTP server.
Lattice Studio: a native macOS app. A SwiftUI instrument panel that drives the Rust engine via CLI subprocesses. Train LoRA adapters with a live loss oscilloscope, quantize models with a before/after comparison, hot-swap adapters in chat with zero reload, and manage your model library from a single window.
| Pure Rust compute | Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA. |
| Metal GPU backend | Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform. |
| Generation models | Qwen3.5-0.8B / 2B, Qwen3.6-35B-A3B (MoE), Qwen3.6-27B. Hybrid GatedDeltaNet + GQA architecture. |
| Embedding models | 9 models: BGE, E5, MiniLM, Qwen3-Embedding families. Auto-download for 7 BERT-family variants. |
| Three tokenizers | WordPiece, SentencePiece, BPE. No Hugging Face tokenizers C extension. |
| Quantization | Q8, Q4, and QuaRot (rotation-based 4-bit). No other engine runs Q4 + LoRA hot-swap on Qwen3.5. |
| LoRA | Inference hook, hot-swap with no reload, PEFT safetensors format, training via lattice-tune. |
| HTTP API | OpenAI-compatible /v1/chat/completions via lattice serve. |
| Safetensors native | Memory-mapped weight loading. Single-file and sharded checkpoints. |
| KV cache | Incremental decoding with key-value caching. |
| Speculative decoding | Draft-model acceleration on the CPU path. |
| Grammar decoding | Constrained output via a pushdown automaton. OpenAI string-level stop sequences. |
| MRL support | Matryoshka truncation for Qwen3-Embedding models (output dimension >= 32). |
| LRU cache | CachedEmbeddingService with sharded in-memory cache and hit/miss stats. |
| Knowledge distillation | Train small models from Claude/GPT/Gemini teacher soft labels via lattice-tune. |
| Optimal transport | Sinkhorn-Knopp solver for embedding drift detection via lattice-transport. |
Measured on Apple M2 Max, Qwen3.5-0.8B, slope method (token throughput excluding prompt prefill and model load). Greedy decoding, median of 5 runs.
| Context | Lattice (Q8, f16 head) | Ollama (Q8_0) | MLX (Q8 g64, AMX) | Lattice vs Ollama |
|---|---|---|---|---|
| 64 tok | 187 tok/s | 93 | 265 | 2.0x |
| 128 tok | 171 tok/s | 92 | 263 | 1.9x |
| 256 tok | 146 tok/s | 88 | 260 | 1.6x |
MLX uses Apple's private MPS/AMX matrix engines. Lattice uses the public Metal compute API, the same tier as Ollama. MLX decodes faster than Lattice at raw throughput. Lattice's edge is portability (pure Rust, zero Python, zero framework) plus capabilities neither Ollama nor MLX provide for this model family:
| Capability | Lattice | MLX | Ollama |
|---|---|---|---|
| QuaRot 4-bit (rotation-based quant) | yes | no | no |
| Q4 + LoRA hot-swap (no reload) | yes | no | no |
| Pure Rust, zero Python or framework | yes | no | no |
PPL parity confirmed: Lattice 20.60, MLX 20.67 on wikitext-2 (2048 tokens). Reproduce:
./scripts/bench_context_scaling.sh
| Crate | Description | crates.io |
|---|---|---|
lattice-embed |
Embedding service: EmbeddingService trait, NativeEmbeddingService, CachedEmbeddingService, SIMD cosine/dot/euclidean, backfill |
|
lattice-inference |
Transformer kernel: safetensors loading, BERT/BGE/Qwen3 forward pass, three tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding, quantization | |
lattice-fann |
Fast neural network primitives: NetworkBuilder, pre-allocated layers, zero-alloc forward pass, backprop trainer |
|
lattice-tune |
Training: knowledge distillation pipeline, dataset management, LoRA adapter management, model registry | |
lattice-transport |
Optimal transport: Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection |
The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies
and can be used standalone.
[dependencies]
lattice-embed = "0.3"
tokio = { version = "1", features = ["full"] }use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let service = NativeEmbeddingService::default();
// Single embedding (BGE-small-en-v1.5, 384 dimensions)
let embedding = service
.embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
.await?;
println!("Dimensions: {}", embedding.len()); // 384
// Batch
let texts = vec![
"First document".to_string(),
"Second document".to_string(),
];
let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;
// SIMD-accelerated similarity
let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
println!("Similarity: {:.4}", similarity);
Ok(())
}Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models
(or $LATTICE_MODEL_CACHE).
lattice-embed = { version = "0.4", features = ["metal-gpu"] }lattice-embed = { version = "0.4", features = ["wgpu-gpu"] }Build from source (requires Rust 1.80+ and, for Metal, macOS 14+):
git clone https://github.com/ohdearquant/lattice
cd lattice
# CLI binary (chat + serve)
cargo build --release -p lattice-inference --bin lattice
# Interactive chat
./target/release/lattice chat --model ~/.lattice/models/qwen3.5-0.8b
# OpenAI-compatible HTTP server
./target/release/lattice serve --model ~/.lattice/models/qwen3.5-0.8b --port 8080With Metal GPU (macOS only):
cargo build --release -p lattice-inference --bin lattice --features metal-gpu,f16lattice serve exposes an OpenAI-compatible endpoint:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-0.8b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 128,
"temperature": 0.7
}'Lattice Studio is a native SwiftUI app for macOS 14+. It wraps the Rust engine in an instrument-panel interface: live loss curves, before/after quantization comparisons, LoRA hot-swap in chat, and a model library manager.
# Requires: Xcode (Swift 6.3+), Rust toolchain
./apps/macos/scripts/package-app.shThis builds the Swift frontend, compiles all Rust engine binaries in release mode, and
assembles a self-contained LatticeStudio.app bundle at apps/macos/dist/. A .dmg
and .zip are also produced. The packaged app needs no Rust toolchain on the recipient
machine.
# Skip Swift rebuild (use existing build output)
./apps/macos/scripts/package-app.sh --skip-build
# Skip Cargo rebuild
./apps/macos/scripts/package-app.sh --skip-cargoDrag LatticeStudio.app from the .dmg to /Applications.
The app is ad-hoc signed. On first launch, right-click and choose "Open" to bypass Gatekeeper, then click "Open" in the dialog. macOS remembers the exception for subsequent launches. Alternatively:
xattr -dr com.apple.quarantine /Applications/LatticeStudio.appModels (cmd-1). A table of all local models and adapters under ~/.lattice/models, with
file manifests, config details (18 GatedDeltaNet + 6 GQA layers called out explicitly), and
Download/Verify/Reveal in Finder actions.
Train (cmd-2). LoRA fine-tuning with a live loss oscilloscope. A 56-point hero loss numeral ticks digit-by-digit as the run progresses. Scrub the loss curve to freeze all metric readouts (lr, grad-norm, tok/s, ETA) to any step. Results stream as line-delimited JSON from the engine.
Quantize (cmd-3). Q4 or QuaRot quantization. A side-by-side comparison shows size, bits, and estimated PPL before and after. True-scale bars animate to show the compression ratio. A gate pill states the result: PASS, WARN, or FAIL.
Chat (cmd-4). Side-by-side base vs. base+adapter columns streaming the same prompt in parallel. The adapter selector is a console fader: slide it and the engine swaps the adapter with no reload. A "0 ms reload" stamp confirms it. Each column shows live tok/s and time-to-first-token.
Data (cmd-5). Paste raw text or load files. Preview the derived {prompt, completion} pairs
as a JSONL table, validate token counts (using the same tokenizer the engine uses), and export
straight into the Train dataset field.
Runs (cmd-6). Archive of every training and quantization run. Select a row to reopen its exact config and view the frozen loss curve.
The command bar (cmd-K) runs everything from a single keyboard shortcut:
train qwen3.5 r8 or quantize quarot parse into argument chips and fire a run.
| Variant | HuggingFace ID | Dims | Max tokens | Auto-download |
|---|---|---|---|---|
BgeSmallEnV15 |
BAAI/bge-small-en-v1.5 |
384 | 512 | yes |
BgeBaseEnV15 |
BAAI/bge-base-en-v1.5 |
768 | 512 | yes |
BgeLargeEnV15 |
BAAI/bge-large-en-v1.5 |
1024 | 512 | yes |
MultilingualE5Small |
intfloat/multilingual-e5-small |
384 | 512 | yes |
MultilingualE5Base |
intfloat/multilingual-e5-base |
768 | 512 | yes |
AllMiniLmL6V2 |
sentence-transformers/all-MiniLM-L6-v2 |
384 | 256 | yes |
ParaphraseMultilingualMiniLmL12V2 |
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
384 | 128 | yes |
Qwen3Embedding0_6B |
Qwen/Qwen3-Embedding-0.6B |
1024 | 8192 | local dir only |
Qwen3Embedding4B |
Qwen/Qwen3-Embedding-4B |
2560* | 8192 | local dir only |
*Qwen3-Embedding-4B and 0.6B support MRL truncation to any dimension >= 32.
BGE v1.5 uses CLS pooling. E5 and MiniLM use mean pooling.
E5 embed_passage() applies the "passage: " prefix automatically.
| Config preset | Description |
|---|---|
Qwen35Config::qwen35_0_8b |
24 layers, 1024 hidden, 1 MTP layer. Base decode shipped; MTP experimental. |
Qwen35Config::qwen35_2b |
24 layers, 2048 hidden, dense FFN, tied embeddings. |
Qwen35Config::qwen36_35b_a3b |
40 layers, MoE 256 experts top-8. Config and weight loader supported. |
Qwen35Config::qwen36_27b |
64 layers, 5120 hidden, dense FFN. |
The Qwen3.5 architecture uses a hybrid of 18 GatedDeltaNet layers and 6 GQA attention layers. Lattice is the only open-source engine that correctly runs this hybrid recurrence at Q4 on Apple Silicon.
use lattice_embed::EmbeddingModel;
// Fast English retrieval
let model = EmbeddingModel::BgeSmallEnV15; // 384-dim, fastest, auto-download
// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15; // 768-dim, auto-download
// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15; // 1024-dim, auto-download
// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base; // 768-dim, 100+ languages
// Long context + multilingual (local files required)
let model = EmbeddingModel::Qwen3Embedding0_6B; // 1024-dim, 8K context
// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:
use lattice_embed::utils;
// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);
utils::normalize(&mut vector); // in-place L2 normalization
// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);Measured performance on normalized vectors (internal benchmarks, subject to hardware):
| Operation | Scalar | SIMD |
|---|---|---|
| cosine similarity (384-dim) | ~650 ns | ~90 ns |
| cosine similarity (768-dim) | ~1300 ns | ~180 ns |
| cosine similarity (1024-dim) | ~1700 ns | ~240 ns |
| Feature | Default | Description |
|---|---|---|
native |
yes | Pure Rust inference via lattice-inference |
metal-gpu |
no | Metal GPU acceleration (macOS) |
avx512 |
no | AVX-512 SIMD kernels (requires nightly) |
| Feature | Default | Description |
|---|---|---|
f16 |
no | Half-precision weights |
metal-gpu |
no | Metal compute backend |
wgpu-gpu |
no | WGPU cross-platform GPU backend |
download |
yes | HuggingFace weight download with checksum verification |
backfill |
no | Re-embedding coordinator (requires rusqlite) |
Application
|
v
lattice-embed (public API: embedding service, SIMD distance ops, LRU cache)
|
v
lattice-inference (transformer kernel: BERT/Qwen3 forward pass, tokenizers, weights)
|
+---> CPU (primary) Metal (macOS) WGPU (fallback)
AVX2/NEON kernels Apple Silicon Vulkan/DX12
lattice-fann (standalone: tiny network primitives, <5ms CPU inference)
lattice-transport (standalone: optimal transport math, Wasserstein distances)
lattice-tune (depends on fann + inference: LoRA, distillation, model registry)
cargo bench --package lattice-embedcargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode./scripts/bench_context_scaling.shPerformance depends on hardware, model size, batch size, and sequence length. Run benchmarks on your target hardware for representative numbers.
- Architecture: crate dependency graph, design decisions, stability tiers
- Models: full model support matrix, attention variants, inference features
- Getting started: step-by-step setup guide
- Examples: code samples for common tasks
- ADR directory:
docs/adr/
Leaf crates publish first: inference then fann then transport (wait 30s) then embed
(wait 30s) then tune.
make publishApache-2.0. See LICENSE.
Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.