Skip to content
View KarnbirKhera's full-sized avatar

Block or report KarnbirKhera

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
KarnbirKhera/README.md

Karnbir Khera

GPU Programming · MLSys 2026 FlashInfer Competitor

Currently competing in the MLSys 2026 NVIDIA FlashInfer Sparse Attention track on Blackwell B200, and writing weekly about each kernel that forces the framework to grow another layer on LinkedIn. Building a derivation-based mental model of GPU computation — the Two Tree Framework — where every kernel decision (memory binding, FSM phase, sync point, indexing) traces back to either the problem geometry or a hardware constraint, rather than being memorized from existing implementations.

Profiled architectures: Ada Lovelace (RTX 4060) · Blackwell (B200, sm_100a) Next Learning Arc: Polyhedral model, Dataflow analysis, Abstract algebra to encode the patterns from 9 weeks of kernel work into MLIR, and learn how modern compilers produce optimized kernels.


Featured Projects

Submission to the NVIDIA MLSys 2026 FlashInfer AI Kernel Generation Contest on B200. A derivation-first agentic pipeline (GitHub) takes a kernel spec and hardware target, derives the structure (geometry → algorithm class → access pattern → FSM phases → lifetime tables → indexing) before any code is written, and binds those decisions to sm_100a only at the end so every optimization is auditable back to a problem-space, hardware-space or empirically derived reason.

A weekly progression of CUDA kernels building toward sparse attention: shared memory → GEMM → softmax (FP8) → dense attention → paged KV (MLA) → top-k indexer → sparse attention → optimization. Each week's kernel is the artifact that forced a new layer of the framework. For those who are learning CUDA, I would recommend this repo as every kernel is commented line by line, especially in the latter kernels. The comments contain what each line of code means, and how it contributes to the overall structure of the kernel.

The Two Tree Framework started as a struggle to reconstruct a tiled GEMM kernel the day after reading it, and grew into a derivation system that builds kernels up from problem geometry rather than pattern-matching from existing code. V1 builds every index from first principles. Each reduces to Coordinate × Stride + Offset, found by mapping the execution tree (Grid → Block → Thread) onto the memory tree (Global → Shared → Register). The goal is pedagogical, making the path to a correct kernel feel learnable rather than memorized.

Each kernel that followed forced a new layer.

  • Softmax surfaced algorithm classification, because row-max-before-normalization is an algorithm constraint, not a shape one.
  • Dense attention surfaced FSM phases, because Load → Compute → Store breaks once softmax sits in the middle.
  • Paged KV surfaced access patterns, because paged and contiguous caches cross the affine / non-affine boundary at the same shape.
  • Top-K surfaced a refinement to the contraction classification, splitting REDUCE (associative and commutative) from GATE (associative but not commutative, where order is preserved, like scan-class reductions).

My first CUDA project where I implemented vector add with six implementations (naive → grid-stride → float4 vectorization → ILP=2 → ILP=4) profiled on RTX 4060 with NVIDIA Nsight Compute, comparing measured throughput against the 272 GB/s theoretical bandwidth across 10M / 100M / 200M element runs.

The deeper investigation began with an unexpected ~31% L2 hit rate on a fully streaming kernel. This led to designing isolated micro-benchmarks (read-only, write-only, coalesced vs. uncoalesced) which revealed that the L2 hit rate only surfaced during writes, not reads. I found that odd at the time so I ended up finding an arXiv paper about the write-validate policy on the Volta architecture. I re-created the same testing environment on the Ada architecture and found a strong correlation between the paper and my results. I then conducted the same test on the B200 Blackwell architecture and found very similar results, strongly showing that the same write-validate policy was present. This is covered in the following two LinkedIn posts. Post 1 and Post 2.


Now

  • Wrapping up the FlashInfer competition with extension tracks: MoE (sparse dispatch with runtime routing the geometry can't fully predict) and GDN (gated delta-net for Qwen3-Next, sitting outside semiring/monoid classification)
  • Starting the next arc, MLIR + Compiler Theory, picking up the polyhedral model, dataflow analysis, and abstract algebra alongside MLIR so the patterns from 9 weeks of kernel work have a home in modern compilers.

Pinned Loading

  1. CUDA-Vector-Addition-40-Page-Insight CUDA-Vector-Addition-40-Page-Insight Public

    Cuda 3

  2. MLSys2026-9Week-LearningPlan MLSys2026-9Week-LearningPlan Public

    Cuda 9 1

  3. CUDA-TwoTreeFramework CUDA-TwoTreeFramework Public

    A systematic and pedagogical way to derive the correctness structure of 2D Register Allocated GEMM before coding.

    HTML 7

  4. KarnbirKhera-MLSys2026-dsa_sparse_attention KarnbirKhera-MLSys2026-dsa_sparse_attention Public

    MLSys 2026 FlashInfer Contest — DSA Sparse Attention kernel

    Cuda

  5. KarnbirKhera-MLSys2026-dsa_topk_indexer KarnbirKhera-MLSys2026-dsa_topk_indexer Public

    MLSys 2026 FlashInfer Contest — DSA TopK Indexer kernel

    Cuda

  6. MLSys2026-Kernel-Agent-Framework-Templates MLSys2026-Kernel-Agent-Framework-Templates Public

    Python 1