NVIDIA · Edwardf0t1 · Jun 17, 2026 · Jun 23, 2026 · kevalmorabia97 · Jun 17, 2026
@@ -5,7 +5,7 @@ Common detection for all ModelOpt skills. After this, you know what's available.
 ## Env-1. Get ModelOpt source
 
 ```bash
-ls examples/llm_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
+ls examples/hf_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
 ```
 
 If not found: `git clone https://github.com/NVIDIA/Model-Optimizer.git && cd Model-Optimizer`

@@ -62,4 +62,4 @@ This matrix covers officially validated combinations. For unlisted models:
 
 - **NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference.
 - INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang).
-- Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`
+- Source: `examples/hf_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`
@@ -5,7 +5,7 @@ description: This skill should be used when the user asks to "quantize a model",
 
 # ModelOpt Post-Training Quantization
 
-Produce a quantized checkpoint from a pretrained model. **Read `examples/llm_ptq/README.md` first** — it has the support matrix, CLI flags, and accuracy guidance.
+Produce a quantized checkpoint from a pretrained model. **Read `examples/hf_ptq/README.md` first** — it has the support matrix, CLI flags, and accuracy guidance.
 
 ## Step 1 — Environment
 
@@ -19,7 +19,7 @@ Read `skills/common/environment-setup.md` and `skills/common/workspace-managemen
 
 ## Step 2 — Is the model supported?
 
-Check the support table in `examples/llm_ptq/README.md` for verified HF models.
+Check the support table in `examples/hf_ptq/README.md` for verified HF models.
 
 - **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
 - **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)
@@ -53,7 +53,7 @@ ls modelopt_recipes/huggingface/<model_type>/ptq/ 2>/dev/null  # per-arch; <mode
 
 If a model-specific recipe exists, prefer `--recipe <path>` — but **inspect its include/exclude patterns** rather than assuming (e.g. for VLMs, confirm the vision tower is actually excluded).
 
-**If no model-specific recipe**, choose a format based on GPU (details in `examples/llm_ptq/README.md`):
+**If no model-specific recipe**, choose a format based on GPU (details in `examples/hf_ptq/README.md`):
 
 - **Blackwell** (B100/B200/GB200): `nvfp4` variants
 - **Hopper** (H100/H200) or older: `fp8` or `int4_awq`
@@ -90,9 +90,9 @@ In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHE
 
 ```bash
 pip install --no-build-isolation "nvidia-modelopt[hf]"
-pip install -r examples/llm_ptq/requirements.txt
+pip install -r examples/hf_ptq/requirements.txt
 
-python examples/llm_ptq/hf_ptq.py \
+python examples/hf_ptq/hf_ptq.py \
     --pyt_ckpt_path <model> \
     --qformat <format> \
     --calib_size 512 \
@@ -105,7 +105,7 @@ For remote: use `remote_run` from `remote_exec.sh` (see `skills/common/remote-ex
 
 ### 4B — Launcher: supported model on SLURM or local Docker
 
-Write a YAML config using `common/hf_ptq/hf_ptq.sh`. See `references/launcher-guide.md` for the full template.
+Write a YAML config using `common/hf/ptq.sh`. See `references/launcher-guide.md` for the full template.
 
 ```bash
 cd tools/launcher
@@ -179,7 +179,7 @@ Report the gate result before moving on. The report must include source size, ou
 | `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
 | `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
 | `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
-| `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
+| `examples/hf_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
 | `modelopt/torch/quantization/config.py` | Step 3: format definitions |
 | `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |
 | `modelopt_recipes/` | Step 3: pre-built recipes |
@@ -7,7 +7,7 @@ monitoring), see `skills/common/slurm-setup.md`.
 
 ## 1. Container
 
-Get the recommended image version from `examples/llm_ptq/README.md`, then look for an existing `.sqsh` file:
+Get the recommended image version from `examples/hf_ptq/README.md`, then look for an existing `.sqsh` file:
 
 ```bash
 ls *.sqsh ../*.sqsh ~/containers/*.sqsh 2>/dev/null
@@ -63,17 +63,17 @@ pip install -U transformers --no-deps
 
 Estimate GPU count from model size and available GPU memory. `hf_ptq.py` uses `device_map="auto"` so it fills GPUs automatically — request only as many as needed.
 
-For multi-node PTQ (200B+ params), use `examples/llm_ptq/multinode_ptq.py` with FSDP2 and accelerate:
+For multi-node PTQ (200B+ params), use `examples/hf_ptq/multinode_ptq.py` with FSDP2 and accelerate:
 
 ```bash
 accelerate launch \
-    --config_file examples/llm_ptq/fsdp2.yaml \
+    --config_file examples/hf_ptq/fsdp2.yaml \
     --num_machines $NUM_NODES \
     --num_processes $((NUM_NODES * GPUS_PER_NODE)) \
     --main_process_ip $MASTER_ADDR \
     --main_process_port $MASTER_PORT \
     --machine_rank $SLURM_PROCID \
-    examples/llm_ptq/multinode_ptq.py \
+    examples/hf_ptq/multinode_ptq.py \
         --pyt_ckpt_path <model> \
         --qformat <format> \
         --export_path <output>

@@ -1,6 +1,6 @@
 # Handling Unlisted Models
 
-The model is not in the verified support table (`examples/llm_ptq/README.md`). This does NOT mean it won't work — ModelOpt auto-detects standard HF modules (linear layers, attention, MoE blocks with `gate`+`experts`). Many unlisted models work with `hf_ptq.py` out of the box.
+The model is not in the verified support table (`examples/hf_ptq/README.md`). This does NOT mean it won't work — ModelOpt auto-detects standard HF modules (linear layers, attention, MoE blocks with `gate`+`experts`). Many unlisted models work with `hf_ptq.py` out of the box.
 
 Follow the investigation steps below to determine if `hf_ptq.py` works or if patches are needed.
 
@@ -147,7 +147,7 @@ class QuantCustomModule(OriginalModule):
 | Fused 2D weights (experts stacked in rows) | Two-level expansion | `_QuantDbrxExpertGLU` |
 | Fused weights + `forward(x, expert_id)` | Expand + reconstruct on export | `_QuantMoELinear` (Step3.5) |
 
-For the full guide, see `examples/llm_ptq/moe.md`.
+For the full guide, see `examples/hf_ptq/README.md`.
 
 **Critical: always check the weight layout.** `nn.Linear` expects `(out_features, in_features)` — the last dimension must be `in_features`. If the fused tensor is `(num_experts, in_dim, out_dim)`, you must transpose (`.T`) when copying. Getting this wrong silently corrupts quantization scales. Inspect the original forward pass to determine which dimension is which.
 

@@ -49,7 +49,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
 /examples/llm_autodeploy @NVIDIA/modelopt-deploy-codeowners
 /examples/llm_distill @NVIDIA/modelopt-torch-distill-codeowners
 /examples/llm_eval @NVIDIA/modelopt-examples-llm_ptq-codeowners
-/examples/llm_ptq @NVIDIA/modelopt-examples-llm_ptq-codeowners
+/examples/hf_ptq @NVIDIA/modelopt-examples-llm_ptq-codeowners
 /examples/llm_qat @NVIDIA/modelopt-examples-llm_qat-codeowners
 /examples/llm_sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 /examples/megatron_bridge @NVIDIA/modelopt-examples-megatron-codeowners
@@ -60,7 +60,6 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
 /examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners
-/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
 /examples/vllm_serve @NVIDIA/modelopt-examples-llm_ptq-codeowners
 /examples/windows @NVIDIA/modelopt-windows-codeowners
 

@@ -9,7 +9,7 @@ on:
         required: true
         type: string
       example:
-        description: "Example name to test (e.g. 'llm_ptq')"
+        description: "Example name to test (e.g. 'hf_ptq')"
         required: true
         type: string
       timeout_minutes:

@@ -55,7 +55,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        example: [llm_ptq]
+        example: [hf_ptq]
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
@@ -69,7 +69,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        example: [llm_autodeploy, llm_eval, llm_ptq]
+        example: [llm_autodeploy, llm_eval, hf_ptq]
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:

@@ -6,8 +6,9 @@ Changelog
 
 **Deprecations**
 
-- Consolidated ``examples/vlm_ptq`` into ``examples/llm_ptq``. Vision-language model PTQ now shares the ``hf_ptq.py`` entry point and ``scripts/huggingface_example.sh``; pass ``--vlm`` to run the TensorRT-LLM multimodal quickstart smoke test. The ``examples/vlm_ptq/scripts/huggingface_example.sh`` entry point is deprecated: it now prints a warning and forwards to the ``llm_ptq`` script with ``--vlm``, and will be removed in a future release. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#vlm-quantization>`__.
-- Dropped VILA / NVILA vision-language model support in ``examples/llm_ptq``. VILA's modeling code requires ``transformers<=4.50.0``, which conflicts with ModelOpt's minimum supported ``transformers`` version. The VILA-specific bootstrap (repo clone, ``requirements-vila.txt``) and loading paths in ``example_utils.py`` have been removed.
+- Renamed ``examples/llm_ptq`` to ``examples/hf_ptq`` to reflect that it covers Hugging Face LLM **and** VLM PTQ. A relative symlink ``examples/llm_ptq`` -> ``hf_ptq`` keeps existing paths and commands working; it will be removed in a future release. Please update references to the new ``examples/hf_ptq`` path.
+- Consolidated ``examples/vlm_ptq`` into ``examples/hf_ptq``. Vision-language model PTQ now shares the ``hf_ptq.py`` entry point and ``scripts/huggingface_example.sh``; pass ``--vlm`` to run the TensorRT-LLM multimodal quickstart smoke test. The ``examples/vlm_ptq/scripts/huggingface_example.sh`` entry point is deprecated: it now prints a warning and forwards to the ``hf_ptq`` script with ``--vlm``, and will be removed in a future release. See `examples/hf_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/hf_ptq#vlm-quantization>`__.
+- Dropped VILA / NVILA vision-language model support in ``examples/hf_ptq``. VILA's modeling code requires ``transformers<=4.50.0``, which conflicts with ModelOpt's minimum supported ``transformers`` version. The VILA-specific bootstrap (repo clone, ``requirements-vila.txt``) and loading paths in ``example_utils.py`` have been removed.
 
 **New Features**
 

@@ -30,7 +30,7 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
 - [2026/05/13] [**Puzzletron**](./examples/puzzletron): A new algorithm for heterogeneous pruning & NAS of LLM and VLM models.
 - [2026/04/15] Customer story: [Domyn compresses Colosseum-355B → 260B using ModelOpt's Minitron pruning + distillation](https://www.domyn.com/blog/domyn-large-the-journey-of-a-european-sovereign-ai-model-for-regulated-industries)
 - [2026/03/17] Customer story: [Bielik.AI builds Bielik Minitron 7B (33% smaller, 50% faster, 90% quality retained) using ModelOpt's Minitron pruning + distillation](https://bielik.ai/en/nvidia-gtc-bielik-minitron-premiere/)
-- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
+- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/hf_ptq/README.md)
 - [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
 - [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
 - [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
@@ -42,10 +42,10 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
 - [2025/06/24] [BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
 - [2025/05/14] [NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/)
 - [2025/04/21] [Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
-- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
+- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/hf_ptq/README.md#support-matrix)
 - [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
 - [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
-- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
+- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/hf_ptq/README.md#getting-started).
 - [2025/01/28] Model Optimizer is now open source!
 
 <details close>
@@ -56,7 +56,7 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
 - [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
 - [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
 - [2024/08/15] New features in recent releases: [Cache Diffusion](./examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
-- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
+- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/hf_ptq/README.md#vllm)
 - [2024/05/08] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
 - [2024/03/27] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
 - [2024/03/18] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)
@@ -102,7 +102,7 @@ more fine-grained control on installed dependencies or for alternative docker im
 
 | **Technique** | **Description** | **Examples** | **Docs** |
 | :------------: | :------------: | :------------: | :------------: |
-| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[HF LLMs / VLMs](./examples/llm_ptq/)\] \[[Megatron-Bridge LLMs / VLMs](./examples/megatron_bridge/)\] \[[Diffusers](./examples/diffusers/)\] \[[ONNX](./examples/onnx_ptq/)\] \[[Windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
+| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[HF LLMs / VLMs](./examples/hf_ptq/)\] \[[Megatron-Bridge LLMs / VLMs](./examples/megatron_bridge/)\] \[[Diffusers](./examples/diffusers/)\] \[[ONNX](./examples/onnx_ptq/)\] \[[Windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Quantization Aware Training / Distillation | Refine accuracy of quantized models even further with a few training steps! | \[[Hugging Face](./examples/llm_qat/)\] \[[Megatron-Bridge](./examples/megatron_bridge)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Pruning | Reduce your model parameters or memory footprint and accelerate inference by removing unnecessary weights! | \[[General](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/)\] | |
 | Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[Hugging Face](./examples/llm_distill/)\] \[[Megatron-Bridge](./examples/megatron_bridge/)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
@@ -130,8 +130,7 @@ more fine-grained control on installed dependencies or for alternative docker im
 
 | Model Type | Support Matrix |
 |------------|----------------|
-| LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) |
-| VLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#hugging-face-supported-models) |
+| LLM / VLM Quantization | [View Support Matrix](./examples/hf_ptq/README.md#support-matrix) |
 | Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) |
 | ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) |
 | Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) |