Efficient LLM Experiments

A comprehensive experimental framework for testing novel LLM optimization techniques on small models (<2GB).

Project Overview

This project implements and benchmarks 5 cutting-edge techniques for efficient LLM inference and adaptation:

1. TTT-LoRA (Test-Time Training with Self-Supervised LoRA)

Location: experiments/exp1_ttt_lora/

Novel approach that adapts the model to each input at inference time:

Attaches small LoRA adapter (rank 4-8) to base model
At inference, runs gradient steps on input using self-supervised loss
Only updates LoRA weights (~50KB), not full model
Shows measurable improvement after 5-20 gradient steps

Key Innovation: Real parametric adaptation at inference time - not just prompt engineering.

2. KV-Cache Compression

Location: experiments/exp2_kv_compression/

Implements multiple compression strategies:

H2O (Heavy-Hitter Oracle): Keeps tokens with highest attention scores
SnapKV: Uses observation window to identify important tokens
Sliding Window: Simple baseline keeping recent tokens

Achieves 50-80% memory reduction with minimal quality loss.

3. Speculative Decoding with Self-Drafting

Location: experiments/exp3_speculative_decoding/

Uses early layer exits as a draft model:

Exit at layer 8 (of 22) for fast draft generation
Verify with full model in single forward pass
No separate draft model needed
Target: 2-3x speedup with zero quality loss

4. Mixed-Precision Quantization

Location: experiments/exp4_mixed_precision/

Intelligent per-layer bit allocation:

Sensitivity analysis: measure each layer's response to INT8/4/3/2
Optimization: knapsack algorithm for optimal bit distribution
Result: Near-INT4 size with near-INT8 quality

5. Full Stack Optimization

Location: experiments/exp5_full_stack/

Combines multiple optimizations:

KV cache with use_cache=True
Continuous batching (vLLM-style)
Efficient sampling
Memory management

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run all experiments
python scripts/run_all_experiments.py

# Run specific experiment
python scripts/run_all_experiments.py -e ttt_lora

# Use different model
python scripts/run_all_experiments.py -m qwen2-0.5b

Models Supported

Model	Size	HuggingFace ID
TinyLlama (default)	~1.1GB	TinyLlama/TinyLlama-1.1B-Chat-v1.0
Qwen2-0.5B	~500MB	Qwen/Qwen2-0.5B-Instruct
Qwen2-1.5B	~1.5GB	Qwen/Qwen2-1.5B-Instruct
SmolLM	~1.7GB	HuggingFaceTB/SmolLM2-1.7B-Instruct

Project Structure

efficient_llm_experiments/
├── shared/                    # Shared utilities
│   ├── model_loader.py        # Model loading with MPS/CUDA support
│   ├── logging_utils.py       # Structured .txt logging
│   ├── profiling_utils.py     # Memory and time profiling
│   └── metrics.py             # Perplexity, speedup metrics
├── experiments/
│   ├── exp1_ttt_lora/         # Test-Time Training
│   ├── exp2_kv_compression/   # KV Cache Compression
│   ├── exp3_speculative_decoding/  # Speculative Decoding
│   ├── exp4_mixed_precision/  # Quantization Analysis
│   └── exp5_full_stack/       # Full Optimization Stack
├── logs/                      # Experiment outputs (.txt)
├── scripts/
│   └── run_all_experiments.py # Master runner
└── requirements.txt

Key Results

Results are logged to logs/ directory in structured .txt format, including:

Perplexity before/after optimizations
Speedup ratios vs baseline
Memory usage and compression ratios
Per-layer sensitivity analysis
Acceptance rates for speculative decoding

Research References

TTT: Sun et al., "Learning to (Learn at Test Time)"
H2O: Zhang et al., "Heavy-Hitter Oracle for Efficient Generative Inference"
SnapKV: Li et al., "SnapKV: LLM Knows What You are Looking for"
Speculative Decoding: Leviathan et al., "Fast Inference from Transformers"
LayerSkip: Elhoushi et al., "Layer Skip: Enabling Early Exit Inference"

Requirements

Python 3.10+
PyTorch 2.1+
transformers 4.36+
peft 0.7+ (for LoRA)
Apple MPS or NVIDIA CUDA

Device Support

Apple Silicon (M1/M2/M3/M4): Uses MPS backend
NVIDIA GPU: Uses CUDA
CPU: Fallback (slower but works)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
experiments		experiments
logs		logs
scripts		scripts
shared		shared
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient LLM Experiments

Project Overview

1. TTT-LoRA (Test-Time Training with Self-Supervised LoRA)

2. KV-Cache Compression

3. Speculative Decoding with Self-Drafting

4. Mixed-Precision Quantization

5. Full Stack Optimization

Quick Start

Models Supported

Project Structure

Key Results

Research References

Requirements

Device Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient LLM Experiments

Project Overview

1. TTT-LoRA (Test-Time Training with Self-Supervised LoRA)

2. KV-Cache Compression

3. Speculative Decoding with Self-Drafting

4. Mixed-Precision Quantization

5. Full Stack Optimization

Quick Start

Models Supported

Project Structure

Key Results

Research References

Requirements

Device Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages