Accelerated Computing Notes

中文 | English

Accelerated Computing Notes

Field notes from an AI infrastructure engineer — covering GPU kernel internals, LLM inference/training optimization, and the full stack from hardware architecture to DSL compilers and LLM-driven kernel agents.

Built over several years of hands-on work shipping AI infrastructure across NVIDIA and AMD platforms. This is not a textbook — it's a practitioner's working reference with source-code-level analysis, cross-platform insights, and real-world optimization notes.

What Makes This Different

Most AI infra resources are either paper summaries or high-level overviews. This repo goes deeper:

Kernel-level code reviews — line-by-line analysis of FlashMLA, MoE GroupGemm, DeepGemm, sglang TBO pipeline, and more
Cross-platform perspective — NVIDIA (Volta → Blackwell) and AMD (MI300/CDNA3), CUDA and HIP, cuBLAS and hipBLAS side by side
End-to-end coverage — from GPU microarchitecture → kernel programming → model architecture → training frameworks → DSL compilers → LLM kernel agents, all connected with cross-references
Frontier topics — Triton compilation pipeline, TileLang, CuTeDSL comparison, LLM-guided auto-tuning, kernel agent architectures

Who Is This For

Kernel engineers writing CUDA/HIP/Triton kernels for AI workloads
AI infra engineers optimizing LLM training and inference systems
System architects designing GPU clusters and parallel computing frameworks
Researchers exploring DSL compilers, auto-tuning, and LLM-assisted kernel development

Prerequisites

Familiarity with C/C++ and Python
Basic understanding of GPU programming concepts (threads, warps, shared memory)
Experience with PyTorch or similar deep learning frameworks

#	Section	Highlights
01	GPU Architecture & AI Systems	Volta → Ampere → Hopper → Blackwell, AMD MI300/CDNA3, DGX best practices
02	Profiling & Benchmarking	Nsight Systems/Compute, PyTorch Profiler, roofline model, NCCL tuning
03	Kernel Programming	CUDA/HIP, cuBLAS, CUTLASS deep dive, CuTe layout/MMA, Triton, TransformerEngine
04	GEMM & Precision	Efficient GEMM pipeline, FP8/INT8, mixed precision, TensorRT
05	Attention Optimization	FlashAttention v1/v2/v3, FlashMLA code review, MLA, KV cache, SageAttn
06	MoE Optimization	GroupGemm code review, DeepEP, dispatch/combine, EPLB
07	Parallelism	TP/EP/SP, compute-communication overlap (TBO), dual-pipe
08	Inference Optimization	Speculative decoding (EAGLE/Medusa/MTP), DeepGemm analysis, continuous batching, serving architecture (vLLM/SGLang)
09	ElementWise Kernels	Efficient Softmax, LayerNorm, GELU/SiLU, fused MLP
10	Quantization	AWQ, SmoothQuant, GPTQ, FP8 PTQ, quantization theory
11	Applications	Diffusion model acceleration
12	RL & Alignment	PPO, GRPO, DPO, veRL code review, MCTS
13	Model Architectures	DeepSeek V3/V3.2 full walkthrough, Qwen3 MoE/Dense
14	Training Frameworks	Megatron-LM 3D parallelism, DeepSpeed ZeRO, FSDP
15	DSL & Compiler	Triton compilation pipeline, TileLang, CuTeDSL, MLIR, auto-tuning
16	Kernel Agent	LLM-driven kernel generation, verification, agent architectures
A	Interview Prep	AI infra interview topics and cross-references

Sections marked [WIP] in sub-pages are under development — contributions welcome.

How to Navigate

Each section contains a README.md with an overview and links to sub-topics. Bold items in the table above are sections with the deepest original analysis.

Recommended reading paths:

Kernel development: 01 → 03 → 04 → 09 → 05 → 15
Inference optimization: 01 → 05 → 06 → 08 → 07 → 10
Training optimization: 01 → 04 → 07 → 14 → 12
Frontier topics: 15 (DSL & Compiler) → 16 (Kernel Agent)

Language

All documents are available in both English and Chinese (中文). Use the language toggle at the top of each page to switch.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the CC BY 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
01-gpu-arch-and-system		01-gpu-arch-and-system
02-profiling-and-benchmarking		02-profiling-and-benchmarking
03-kernel-programming		03-kernel-programming
04-gemm-and-precision		04-gemm-and-precision
05-attention-optimization		05-attention-optimization
06-moe-optimization		06-moe-optimization
07-parallelism		07-parallelism
08-inference-optimization		08-inference-optimization
09-elementwise-kernels		09-elementwise-kernels
10-quantization		10-quantization
11-applications/stable-diffusion		11-applications/stable-diffusion
12-rl-and-alignment		12-rl-and-alignment
13-model-architectures		13-model-architectures
14-training-frameworks		14-training-frameworks
15-dsl-and-compiler		15-dsl-and-compiler
16-kernel-agent		16-kernel-agent
appendix-interview		appendix-interview
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_cn.md		CONTRIBUTING_cn.md
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerated Computing Notes

What Makes This Different

Who Is This For

Prerequisites

Table of Contents

How to Navigate

Language

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Accelerated Computing Notes

What Makes This Different

Who Is This For

Prerequisites

Table of Contents

How to Navigate

Language

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages