A comprehensive experimental framework for testing novel LLM optimization techniques on small models (<2GB).
This project implements and benchmarks 5 cutting-edge techniques for efficient LLM inference and adaptation:
Location: experiments/exp1_ttt_lora/
Novel approach that adapts the model to each input at inference time:
- Attaches small LoRA adapter (rank 4-8) to base model
- At inference, runs gradient steps on input using self-supervised loss
- Only updates LoRA weights (~50KB), not full model
- Shows measurable improvement after 5-20 gradient steps
Key Innovation: Real parametric adaptation at inference time - not just prompt engineering.
Location: experiments/exp2_kv_compression/
Implements multiple compression strategies:
- H2O (Heavy-Hitter Oracle): Keeps tokens with highest attention scores
- SnapKV: Uses observation window to identify important tokens
- Sliding Window: Simple baseline keeping recent tokens
Achieves 50-80% memory reduction with minimal quality loss.
Location: experiments/exp3_speculative_decoding/
Uses early layer exits as a draft model:
- Exit at layer 8 (of 22) for fast draft generation
- Verify with full model in single forward pass
- No separate draft model needed
- Target: 2-3x speedup with zero quality loss
Location: experiments/exp4_mixed_precision/
Intelligent per-layer bit allocation:
- Sensitivity analysis: measure each layer's response to INT8/4/3/2
- Optimization: knapsack algorithm for optimal bit distribution
- Result: Near-INT4 size with near-INT8 quality
Location: experiments/exp5_full_stack/
Combines multiple optimizations:
- KV cache with use_cache=True
- Continuous batching (vLLM-style)
- Efficient sampling
- Memory management
# Install dependencies
pip install -r requirements.txt
# Run all experiments
python scripts/run_all_experiments.py
# Run specific experiment
python scripts/run_all_experiments.py -e ttt_lora
# Use different model
python scripts/run_all_experiments.py -m qwen2-0.5b| Model | Size | HuggingFace ID |
|---|---|---|
| TinyLlama (default) | ~1.1GB | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| Qwen2-0.5B | ~500MB | Qwen/Qwen2-0.5B-Instruct |
| Qwen2-1.5B | ~1.5GB | Qwen/Qwen2-1.5B-Instruct |
| SmolLM | ~1.7GB | HuggingFaceTB/SmolLM2-1.7B-Instruct |
efficient_llm_experiments/
├── shared/ # Shared utilities
│ ├── model_loader.py # Model loading with MPS/CUDA support
│ ├── logging_utils.py # Structured .txt logging
│ ├── profiling_utils.py # Memory and time profiling
│ └── metrics.py # Perplexity, speedup metrics
├── experiments/
│ ├── exp1_ttt_lora/ # Test-Time Training
│ ├── exp2_kv_compression/ # KV Cache Compression
│ ├── exp3_speculative_decoding/ # Speculative Decoding
│ ├── exp4_mixed_precision/ # Quantization Analysis
│ └── exp5_full_stack/ # Full Optimization Stack
├── logs/ # Experiment outputs (.txt)
├── scripts/
│ └── run_all_experiments.py # Master runner
└── requirements.txt
Results are logged to logs/ directory in structured .txt format, including:
- Perplexity before/after optimizations
- Speedup ratios vs baseline
- Memory usage and compression ratios
- Per-layer sensitivity analysis
- Acceptance rates for speculative decoding
- TTT: Sun et al., "Learning to (Learn at Test Time)"
- H2O: Zhang et al., "Heavy-Hitter Oracle for Efficient Generative Inference"
- SnapKV: Li et al., "SnapKV: LLM Knows What You are Looking for"
- Speculative Decoding: Leviathan et al., "Fast Inference from Transformers"
- LayerSkip: Elhoushi et al., "Layer Skip: Enabling Early Exit Inference"
- Python 3.10+
- PyTorch 2.1+
- transformers 4.36+
- peft 0.7+ (for LoRA)
- Apple MPS or NVIDIA CUDA
- Apple Silicon (M1/M2/M3/M4): Uses MPS backend
- NVIDIA GPU: Uses CUDA
- CPU: Fallback (slower but works)