中文 | English
Field notes from an AI infrastructure engineer — covering GPU kernel internals, LLM inference/training optimization, and the full stack from hardware architecture to DSL compilers and LLM-driven kernel agents.
Built over several years of hands-on work shipping AI infrastructure across NVIDIA and AMD platforms. This is not a textbook — it's a practitioner's working reference with source-code-level analysis, cross-platform insights, and real-world optimization notes.
Most AI infra resources are either paper summaries or high-level overviews. This repo goes deeper:
- Kernel-level code reviews — line-by-line analysis of FlashMLA, MoE GroupGemm, DeepGemm, sglang TBO pipeline, and more
- Cross-platform perspective — NVIDIA (Volta → Blackwell) and AMD (MI300/CDNA3), CUDA and HIP, cuBLAS and hipBLAS side by side
- End-to-end coverage — from GPU microarchitecture → kernel programming → model architecture → training frameworks → DSL compilers → LLM kernel agents, all connected with cross-references
- Frontier topics — Triton compilation pipeline, TileLang, CuTeDSL comparison, LLM-guided auto-tuning, kernel agent architectures
- Kernel engineers writing CUDA/HIP/Triton kernels for AI workloads
- AI infra engineers optimizing LLM training and inference systems
- System architects designing GPU clusters and parallel computing frameworks
- Researchers exploring DSL compilers, auto-tuning, and LLM-assisted kernel development
- Familiarity with C/C++ and Python
- Basic understanding of GPU programming concepts (threads, warps, shared memory)
- Experience with PyTorch or similar deep learning frameworks
| # | Section | Highlights |
|---|---|---|
| 01 | GPU Architecture & AI Systems | Volta → Ampere → Hopper → Blackwell, AMD MI300/CDNA3, DGX best practices |
| 02 | Profiling & Benchmarking | Nsight Systems/Compute, PyTorch Profiler, roofline model, NCCL tuning |
| 03 | Kernel Programming | CUDA/HIP, cuBLAS, CUTLASS deep dive, CuTe layout/MMA, Triton, TransformerEngine |
| 04 | GEMM & Precision | Efficient GEMM pipeline, FP8/INT8, mixed precision, TensorRT |
| 05 | Attention Optimization | FlashAttention v1/v2/v3, FlashMLA code review, MLA, KV cache, SageAttn |
| 06 | MoE Optimization | GroupGemm code review, DeepEP, dispatch/combine, EPLB |
| 07 | Parallelism | TP/EP/SP, compute-communication overlap (TBO), dual-pipe |
| 08 | Inference Optimization | Speculative decoding (EAGLE/Medusa/MTP), DeepGemm analysis, continuous batching, serving architecture (vLLM/SGLang) |
| 09 | ElementWise Kernels | Efficient Softmax, LayerNorm, GELU/SiLU, fused MLP |
| 10 | Quantization | AWQ, SmoothQuant, GPTQ, FP8 PTQ, quantization theory |
| 11 | Applications | Diffusion model acceleration |
| 12 | RL & Alignment | PPO, GRPO, DPO, veRL code review, MCTS |
| 13 | Model Architectures | DeepSeek V3/V3.2 full walkthrough, Qwen3 MoE/Dense |
| 14 | Training Frameworks | Megatron-LM 3D parallelism, DeepSpeed ZeRO, FSDP |
| 15 | DSL & Compiler | Triton compilation pipeline, TileLang, CuTeDSL, MLIR, auto-tuning |
| 16 | Kernel Agent | LLM-driven kernel generation, verification, agent architectures |
| A | Interview Prep | AI infra interview topics and cross-references |
Sections marked [WIP] in sub-pages are under development — contributions welcome.
Each section contains a README.md with an overview and links to sub-topics. Bold items in the table above are sections with the deepest original analysis.
Recommended reading paths:
- Kernel development: 01 → 03 → 04 → 09 → 05 → 15
- Inference optimization: 01 → 05 → 06 → 08 → 07 → 10
- Training optimization: 01 → 04 → 07 → 14 → 12
- Frontier topics: 15 (DSL & Compiler) → 16 (Kernel Agent)
All documents are available in both English and Chinese (中文). Use the language toggle at the top of each page to switch.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the CC BY 4.0 license.