Skip to content

ohdearquant/lattice

Repository files navigation

Lattice

Pure Rust inference engine for transformer models on Apple Silicon, with a native macOS app.

License Crates.io CI

No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph in Rust: weight loading, tokenization, forward pass, vector operations, quantization, and LoRA training. Hand-written Metal shaders accelerate inference on Apple Silicon. SIMD kernels (AVX2 on x86, NEON on ARM) handle the CPU path.


What is Lattice

Lattice is two things in one repo.

A Rust inference library. Five published crates covering embeddings, generation, quantization, LoRA fine-tuning, and optimal transport. Use lattice-embed as a library dependency, or run the lattice binary for interactive chat and an OpenAI-compatible HTTP server.

Lattice Studio: a native macOS app. A SwiftUI instrument panel that drives the Rust engine via CLI subprocesses. Train LoRA adapters with a live loss oscilloscope, quantize models with a before/after comparison, hot-swap adapters in chat with zero reload, and manage your model library from a single window.


Capabilities

Pure Rust compute Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA.
Metal GPU backend Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform.
Generation models Qwen3.5-0.8B / 2B, Qwen3.6-35B-A3B (MoE), Qwen3.6-27B. Hybrid GatedDeltaNet + GQA architecture.
Embedding models 9 models: BGE, E5, MiniLM, Qwen3-Embedding families. Auto-download for 7 BERT-family variants.
Three tokenizers WordPiece, SentencePiece, BPE. No Hugging Face tokenizers C extension.
Quantization Q8, Q4, and QuaRot (rotation-based 4-bit). No other engine runs Q4 + LoRA hot-swap on Qwen3.5.
LoRA Inference hook, hot-swap with no reload, PEFT safetensors format, training via lattice-tune.
HTTP API OpenAI-compatible /v1/chat/completions via lattice serve.
Safetensors native Memory-mapped weight loading. Single-file and sharded checkpoints.
KV cache Incremental decoding with key-value caching.
Speculative decoding Draft-model acceleration on the CPU path.
Grammar decoding Constrained output via a pushdown automaton. OpenAI string-level stop sequences.
MRL support Matryoshka truncation for Qwen3-Embedding models (output dimension >= 32).
LRU cache CachedEmbeddingService with sharded in-memory cache and hit/miss stats.
Knowledge distillation Train small models from Claude/GPT/Gemini teacher soft labels via lattice-tune.
Optimal transport Sinkhorn-Knopp solver for embedding drift detection via lattice-transport.

Benchmarks

Measured on Apple M2 Max, Qwen3.5-0.8B, slope method (token throughput excluding prompt prefill and model load). Greedy decoding, median of 5 runs.

Context Lattice (Q8, f16 head) Ollama (Q8_0) MLX (Q8 g64, AMX) Lattice vs Ollama
64 tok 187 tok/s 93 265 2.0x
128 tok 171 tok/s 92 263 1.9x
256 tok 146 tok/s 88 260 1.6x

MLX uses Apple's private MPS/AMX matrix engines. Lattice uses the public Metal compute API, the same tier as Ollama. MLX decodes faster than Lattice at raw throughput. Lattice's edge is portability (pure Rust, zero Python, zero framework) plus capabilities neither Ollama nor MLX provide for this model family:

Capability Lattice MLX Ollama
QuaRot 4-bit (rotation-based quant) yes no no
Q4 + LoRA hot-swap (no reload) yes no no
Pure Rust, zero Python or framework yes no no

PPL parity confirmed: Lattice 20.60, MLX 20.67 on wikitext-2 (2048 tokens). Reproduce: ./scripts/bench_context_scaling.sh


Crates

Crate Description crates.io
lattice-embed Embedding service: EmbeddingService trait, NativeEmbeddingService, CachedEmbeddingService, SIMD cosine/dot/euclidean, backfill
lattice-inference Transformer kernel: safetensors loading, BERT/BGE/Qwen3 forward pass, three tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding, quantization
lattice-fann Fast neural network primitives: NetworkBuilder, pre-allocated layers, zero-alloc forward pass, backprop trainer
lattice-tune Training: knowledge distillation pipeline, dataset management, LoRA adapter management, model registry
lattice-transport Optimal transport: Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection

The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies and can be used standalone.


Quick Start: Embeddings

[dependencies]
lattice-embed = "0.3"
tokio = { version = "1", features = ["full"] }
use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service = NativeEmbeddingService::default();

    // Single embedding (BGE-small-en-v1.5, 384 dimensions)
    let embedding = service
        .embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
        .await?;

    println!("Dimensions: {}", embedding.len()); // 384

    // Batch
    let texts = vec![
        "First document".to_string(),
        "Second document".to_string(),
    ];
    let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;

    // SIMD-accelerated similarity
    let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
    println!("Similarity: {:.4}", similarity);

    Ok(())
}

Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models (or $LATTICE_MODEL_CACHE).

GPU acceleration (macOS)

lattice-embed = { version = "0.4", features = ["metal-gpu"] }

Cross-platform GPU

lattice-embed = { version = "0.4", features = ["wgpu-gpu"] }

Quick Start: CLI

Build from source (requires Rust 1.80+ and, for Metal, macOS 14+):

git clone https://github.com/ohdearquant/lattice
cd lattice

# CLI binary (chat + serve)
cargo build --release -p lattice-inference --bin lattice

# Interactive chat
./target/release/lattice chat --model ~/.lattice/models/qwen3.5-0.8b

# OpenAI-compatible HTTP server
./target/release/lattice serve --model ~/.lattice/models/qwen3.5-0.8b --port 8080

With Metal GPU (macOS only):

cargo build --release -p lattice-inference --bin lattice --features metal-gpu,f16

HTTP API

lattice serve exposes an OpenAI-compatible endpoint:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-0.8b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Lattice Studio (macOS App)

Lattice Studio is a native SwiftUI app for macOS 14+. It wraps the Rust engine in an instrument-panel interface: live loss curves, before/after quantization comparisons, LoRA hot-swap in chat, and a model library manager.

Build and package

# Requires: Xcode (Swift 6.3+), Rust toolchain
./apps/macos/scripts/package-app.sh

This builds the Swift frontend, compiles all Rust engine binaries in release mode, and assembles a self-contained LatticeStudio.app bundle at apps/macos/dist/. A .dmg and .zip are also produced. The packaged app needs no Rust toolchain on the recipient machine.

# Skip Swift rebuild (use existing build output)
./apps/macos/scripts/package-app.sh --skip-build

# Skip Cargo rebuild
./apps/macos/scripts/package-app.sh --skip-cargo

Install

Drag LatticeStudio.app from the .dmg to /Applications.

The app is ad-hoc signed. On first launch, right-click and choose "Open" to bypass Gatekeeper, then click "Open" in the dialog. macOS remembers the exception for subsequent launches. Alternatively:

xattr -dr com.apple.quarantine /Applications/LatticeStudio.app

What is in Lattice Studio

Models (cmd-1). A table of all local models and adapters under ~/.lattice/models, with file manifests, config details (18 GatedDeltaNet + 6 GQA layers called out explicitly), and Download/Verify/Reveal in Finder actions.

Train (cmd-2). LoRA fine-tuning with a live loss oscilloscope. A 56-point hero loss numeral ticks digit-by-digit as the run progresses. Scrub the loss curve to freeze all metric readouts (lr, grad-norm, tok/s, ETA) to any step. Results stream as line-delimited JSON from the engine.

Quantize (cmd-3). Q4 or QuaRot quantization. A side-by-side comparison shows size, bits, and estimated PPL before and after. True-scale bars animate to show the compression ratio. A gate pill states the result: PASS, WARN, or FAIL.

Chat (cmd-4). Side-by-side base vs. base+adapter columns streaming the same prompt in parallel. The adapter selector is a console fader: slide it and the engine swaps the adapter with no reload. A "0 ms reload" stamp confirms it. Each column shows live tok/s and time-to-first-token.

Data (cmd-5). Paste raw text or load files. Preview the derived {prompt, completion} pairs as a JSONL table, validate token counts (using the same tokenizer the engine uses), and export straight into the Train dataset field.

Runs (cmd-6). Archive of every training and quantization run. Select a row to reopen its exact config and view the frozen loss curve.

The command bar (cmd-K) runs everything from a single keyboard shortcut: train qwen3.5 r8 or quantize quarot parse into argument chips and fire a run.


Supported Models

Embedding models

Variant HuggingFace ID Dims Max tokens Auto-download
BgeSmallEnV15 BAAI/bge-small-en-v1.5 384 512 yes
BgeBaseEnV15 BAAI/bge-base-en-v1.5 768 512 yes
BgeLargeEnV15 BAAI/bge-large-en-v1.5 1024 512 yes
MultilingualE5Small intfloat/multilingual-e5-small 384 512 yes
MultilingualE5Base intfloat/multilingual-e5-base 768 512 yes
AllMiniLmL6V2 sentence-transformers/all-MiniLM-L6-v2 384 256 yes
ParaphraseMultilingualMiniLmL12V2 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 384 128 yes
Qwen3Embedding0_6B Qwen/Qwen3-Embedding-0.6B 1024 8192 local dir only
Qwen3Embedding4B Qwen/Qwen3-Embedding-4B 2560* 8192 local dir only

*Qwen3-Embedding-4B and 0.6B support MRL truncation to any dimension >= 32. BGE v1.5 uses CLS pooling. E5 and MiniLM use mean pooling. E5 embed_passage() applies the "passage: " prefix automatically.

Generation models (local files required)

Config preset Description
Qwen35Config::qwen35_0_8b 24 layers, 1024 hidden, 1 MTP layer. Base decode shipped; MTP experimental.
Qwen35Config::qwen35_2b 24 layers, 2048 hidden, dense FFN, tied embeddings.
Qwen35Config::qwen36_35b_a3b 40 layers, MoE 256 experts top-8. Config and weight loader supported.
Qwen35Config::qwen36_27b 64 layers, 5120 hidden, dense FFN.

The Qwen3.5 architecture uses a hybrid of 18 GatedDeltaNet layers and 6 GQA attention layers. Lattice is the only open-source engine that correctly runs this hybrid recurrence at Q4 on Apple Silicon.


Model Selection (Embeddings)

use lattice_embed::EmbeddingModel;

// Fast English retrieval
let model = EmbeddingModel::BgeSmallEnV15;   // 384-dim, fastest, auto-download

// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15;    // 768-dim, auto-download

// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15;   // 1024-dim, auto-download

// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base;  // 768-dim, 100+ languages

// Long context + multilingual (local files required)
let model = EmbeddingModel::Qwen3Embedding0_6B;  // 1024-dim, 8K context

// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;

Vector Operations

lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:

use lattice_embed::utils;

// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);

utils::normalize(&mut vector);  // in-place L2 normalization

// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);

Measured performance on normalized vectors (internal benchmarks, subject to hardware):

Operation Scalar SIMD
cosine similarity (384-dim) ~650 ns ~90 ns
cosine similarity (768-dim) ~1300 ns ~180 ns
cosine similarity (1024-dim) ~1700 ns ~240 ns

Feature Flags

lattice-embed

Feature Default Description
native yes Pure Rust inference via lattice-inference
metal-gpu no Metal GPU acceleration (macOS)
avx512 no AVX-512 SIMD kernels (requires nightly)

lattice-inference

Feature Default Description
f16 no Half-precision weights
metal-gpu no Metal compute backend
wgpu-gpu no WGPU cross-platform GPU backend
download yes HuggingFace weight download with checksum verification
backfill no Re-embedding coordinator (requires rusqlite)

Architecture

Application
    |
    v
lattice-embed          (public API: embedding service, SIMD distance ops, LRU cache)
    |
    v
lattice-inference      (transformer kernel: BERT/Qwen3 forward pass, tokenizers, weights)
    |
    +---> CPU (primary)      Metal (macOS)     WGPU (fallback)
          AVX2/NEON kernels   Apple Silicon      Vulkan/DX12


lattice-fann           (standalone: tiny network primitives, <5ms CPU inference)
lattice-transport      (standalone: optimal transport math, Wasserstein distances)
lattice-tune           (depends on fann + inference: LoRA, distillation, model registry)

Running Benchmarks

Embedding throughput

cargo bench --package lattice-embed

Metal GPU decode (macOS only, requires model weights)

cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode

Context scaling (Qwen3.5-0.8B vs Ollama vs MLX)

./scripts/bench_context_scaling.sh

Performance depends on hardware, model size, batch size, and sequence length. Run benchmarks on your target hardware for representative numbers.


Documentation

  • Architecture: crate dependency graph, design decisions, stability tiers
  • Models: full model support matrix, attention variants, inference features
  • Getting started: step-by-step setup guide
  • Examples: code samples for common tasks
  • ADR directory: docs/adr/

Publishing

Leaf crates publish first: inference then fann then transport (wait 30s) then embed (wait 30s) then tune.

make publish

License

Apache-2.0. See LICENSE.

Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.

About

pure rust inference engine

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors