Skip to content

ayghri/quantkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuantKit

Optimal quantization methods for block-scaled formats.

Formats

  • NVFP4 — FP4 E2M1 with FP8 E4M3 scales (block sizes 16, 32)
  • MXFP4 — FP4 E2M1 with UE8M0 (power-of-2) scales (block sizes 16, 32)

Methods

Each format supports multiple scale selection strategies:

Method Description
Naive Standard heuristic: s = snap(amax / Q_MAX)
SSE-Optimal Bounded search minimizing sum of squared quantization error
Hessian-Optimal Bounded search minimizing Hessian-weighted error r^T H r using activations

All methods have both pure-PyTorch (reference) and Triton (GPU-accelerated) implementations.

Install

pip install -e .

Requires PyTorch and Triton (for GPU kernels).

Usage

from quantkit import nvfp4_naive, nvfp4_optimal, nvfp4_dequantize, compute_metrics

# W has shape (..., block_size) where block_size is 16 or 32
W_blocked = W.reshape(M, K // 32, 32)

# Quantize: returns (scales, quants)
scales, quants = nvfp4_optimal(W_blocked, dim=-1)

# Dequantize
W_dq = nvfp4_dequantize(scales, quants, dim=-1)

# Or get dequantized output directly
scales, quants, W_dq = nvfp4_optimal(W_blocked, dim=-1, return_dequant=True)

# Compute metrics: ||Q(W)-W||/||W|| and ||XW_q^T - XW^T||/||XW^T||
metrics = compute_metrics(W, W_dq.reshape(M, K), X)

Triton-accelerated versions:

from quantkit import nvfp4_optimal_triton, nvfp4_optimal_hessian_triton

scales, quants, W_dq = nvfp4_optimal_triton(W_blocked, dim=-1, return_dequant=True)

# Hessian-aware (requires activations X)
scales, quants, W_dq = nvfp4_optimal_hessian_triton(W_blocked, dim=-1, return_dequant=True, X=X)

Benchmarks

Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B, with activations from WikiText-2 (max_seq_len=512, num_samples=2048).

python bench/full_bench.py

NVFP4 (block size 16)

Method Weight Error Output Error Triton Speedup
Naive 10.05% 6.89% 1.7x
SSE-Optimal 8.74% 6.04% 7.0x
H-Optimal 9.35% 5.31% 1.8x

MXFP4 (block size 16)

Method Weight Error Output Error Triton Speedup
Naive 11.77% 8.48% 1.7x
SSE-Optimal 11.02% 7.67% 33x
H-Optimal 11.10% 7.62%

Documentation

Full documentation: quantkit.readthedocs.io

Build locally:

pip install -r docs/requirements.txt
cd docs && make html

Contact

About

Optimal Quantization Kit

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages