Skip to content

ZJLi2013/accel_computing_notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

316 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

中文 | English

Accelerated Computing Notes

Field notes from an AI infrastructure engineer — covering GPU kernel internals, LLM inference/training optimization, and the full stack from hardware architecture to DSL compilers and LLM-driven kernel agents.

Built over several years of hands-on work shipping AI infrastructure across NVIDIA and AMD platforms. This is not a textbook — it's a practitioner's working reference with source-code-level analysis, cross-platform insights, and real-world optimization notes.

What Makes This Different

Most AI infra resources are either paper summaries or high-level overviews. This repo goes deeper:

  • Kernel-level code reviews — line-by-line analysis of FlashMLA, MoE GroupGemm, DeepGemm, sglang TBO pipeline, and more
  • Cross-platform perspective — NVIDIA (Volta → Blackwell) and AMD (MI300/CDNA3), CUDA and HIP, cuBLAS and hipBLAS side by side
  • End-to-end coverage — from GPU microarchitecture → kernel programming → model architecture → training frameworks → DSL compilers → LLM kernel agents, all connected with cross-references
  • Frontier topics — Triton compilation pipeline, TileLang, CuTeDSL comparison, LLM-guided auto-tuning, kernel agent architectures

Who Is This For

  • Kernel engineers writing CUDA/HIP/Triton kernels for AI workloads
  • AI infra engineers optimizing LLM training and inference systems
  • System architects designing GPU clusters and parallel computing frameworks
  • Researchers exploring DSL compilers, auto-tuning, and LLM-assisted kernel development

Prerequisites

  • Familiarity with C/C++ and Python
  • Basic understanding of GPU programming concepts (threads, warps, shared memory)
  • Experience with PyTorch or similar deep learning frameworks

Table of Contents

# Section Highlights
01 GPU Architecture & AI Systems Volta → Ampere → Hopper → Blackwell, AMD MI300/CDNA3, DGX best practices
02 Profiling & Benchmarking Nsight Systems/Compute, PyTorch Profiler, roofline model, NCCL tuning
03 Kernel Programming CUDA/HIP, cuBLAS, CUTLASS deep dive, CuTe layout/MMA, Triton, TransformerEngine
04 GEMM & Precision Efficient GEMM pipeline, FP8/INT8, mixed precision, TensorRT
05 Attention Optimization FlashAttention v1/v2/v3, FlashMLA code review, MLA, KV cache, SageAttn
06 MoE Optimization GroupGemm code review, DeepEP, dispatch/combine, EPLB
07 Parallelism TP/EP/SP, compute-communication overlap (TBO), dual-pipe
08 Inference Optimization Speculative decoding (EAGLE/Medusa/MTP), DeepGemm analysis, continuous batching, serving architecture (vLLM/SGLang)
09 ElementWise Kernels Efficient Softmax, LayerNorm, GELU/SiLU, fused MLP
10 Quantization AWQ, SmoothQuant, GPTQ, FP8 PTQ, quantization theory
11 Applications Diffusion model acceleration
12 RL & Alignment PPO, GRPO, DPO, veRL code review, MCTS
13 Model Architectures DeepSeek V3/V3.2 full walkthrough, Qwen3 MoE/Dense
14 Training Frameworks Megatron-LM 3D parallelism, DeepSpeed ZeRO, FSDP
15 DSL & Compiler Triton compilation pipeline, TileLang, CuTeDSL, MLIR, auto-tuning
16 Kernel Agent LLM-driven kernel generation, verification, agent architectures
A Interview Prep AI infra interview topics and cross-references

Sections marked [WIP] in sub-pages are under development — contributions welcome.

How to Navigate

Each section contains a README.md with an overview and links to sub-topics. Bold items in the table above are sections with the deepest original analysis.

Recommended reading paths:

  • Kernel development: 01 → 03 → 04 → 09 → 05 → 15
  • Inference optimization: 01 → 05 → 06 → 08 → 07 → 10
  • Training optimization: 01 → 04 → 07 → 14 → 12
  • Frontier topics: 15 (DSL & Compiler) → 16 (Kernel Agent)

Language

All documents are available in both English and Chinese (中文). Use the language toggle at the top of each page to switch.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the CC BY 4.0 license.

About

this is general notes for GPU based acceleration computing

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors