NVIDIA · danielkorzekwa · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
@@ -1,6 +1,6 @@
 ---
 name: puzzletron
-description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model. Usage: /puzzletron <command> [args]"
+description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval (list/mmlu). Usage: /puzzletron <command> [args]"
 license: Apache-2.0
 ---
 
@@ -11,7 +11,7 @@ license: Apache-2.0
 **STEP 1 — Check args before doing anything else. This is MANDATORY.**
 
 - If args are **empty**, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
-- If the first word of args does **not exactly match** `mip`, `all`, or `add-model`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
+- If the first word of args does **not exactly match** `mip`, `all`, `add-model`, or `eval`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
 
 ---
 
@@ -25,6 +25,9 @@ Available commands:
 - `all <nproc_per_node>` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node)
 - `all progress` — Show live full pipeline progress with timing summary
 - `add-model <hf_model_path>` — Implement descriptor, converter, and configs for an unsupported model
+- `eval list [puzzle_dir]` — List all available checkpoints (teacher + sweep solutions) with their index numbers; auto-discovers puzzle_dir if omitted
+- `eval progress [puzzle_dir]` — Show per-checkpoint MMLU eval status and accuracy; auto-discovers puzzle_dir if omitted
+- `eval mmlu <index|hf_model_path> [--puzzle_dir <dir>] [--limit <N>] [--batch_size <B>]` — Evaluate a checkpoint on MMLU (5-shot); pass index from `eval list` or a direct path; add `--limit N` for a smoke test
 
 Usage: `/puzzletron <command> [args]`
 
@@ -48,7 +51,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
 Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:
 
 ```bash
-set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
+set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
 torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
   --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
   2>&1 | tee ./log.txt | grep "Puzzletron Progress"
@@ -82,7 +85,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
 Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:
 
 ```bash
-set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
+set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
 torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
   --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
   --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
@@ -114,6 +117,73 @@ Run the following Bash command. Present the output to the user wrapped in a fenc
 python3 .agents/skills/puzzletron/mip_sweep.py
 ```
 
+## Command: eval
+
+- If the second word is not exactly `list`, `progress`, or `mmlu`, tell the user: "Unknown eval sub-command. Available: `list`, `progress`, `mmlu`." and **STOP**.
+
+### eval list
+
+Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.
+
+Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:
+
+```bash
+python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]
+```
+
+Present the output to the user wrapped in a fenced code block (``` ... ```).
+
+### eval progress
+
+Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.
+
+Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:
+
+```bash
+python3 .agents/skills/puzzletron/eval_progress.py [<puzzle_dir>]
+```
+
+Present the output to the user wrapped in a fenced code block (``` ... ```).
+
+### eval mmlu
+
+Parse args:
+- `index_or_path` — third word. If missing, ask: "Please provide a checkpoint index (from `eval list`) or a direct HF model path." and **STOP**.
+- `--puzzle_dir <dir>` — optional; used when resolving an index.
+- If `index_or_path` matches `^[0-9]+$`, resolve it to a path by running `python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]` and picking the Nth entry (0-based) from the output lines. If the index is out of range, tell the user and **STOP**.
+- Otherwise treat `index_or_path` as a literal `hf_model_path`.
+- `--limit <N>` — optional integer; omit the flag entirely if not provided.
+- `--batch_size <B>` — optional integer; default `4` if not provided.
+
+Derive `output_path` as `<hf_model_path>/eval_results/mmlu` (always; not user-configurable).
+
+Run the following Bash command, substituting the parsed values:
+
+```bash
+PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \
+  --model hf \
+  --model_args pretrained=<hf_model_path>,dtype=bfloat16,parallelize=True \
+  --tasks mmlu \
+  --num_fewshot 5 \
+  --batch_size <batch_size> \
+  --output_path <hf_model_path>/eval_results/mmlu \
+  [--limit <N>]
+```
+
+(Replace `[--limit <N>]` with the actual `--limit <N>` flag when provided, or omit it.)
+
+**Note on output file location:** lm_eval does not write results directly into `--output_path`. It creates a subdirectory named after the full model path with `/` replaced by `__`, then writes `results_<timestamp>.json` inside it. For example, for a model at `/workspace/foo/bar`, results land at:
+
+```text
+<output_path>/__workspace__foo__bar/results_<timestamp>.json
+```
+
+`eval_progress.py` handles this automatically via a recursive glob.
+
+Stream output to the user as it arrives. When the command finishes:
+- Report the exit code.
+- Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output.
+
 ## Command: add-model
 
 Parse `hf_model_path` from args (the second word). If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**.

@@ -207,3 +207,86 @@ Results from: /workspace/puzzle_dir_qwen3_5-0.8b/mip_sweep_results.csv
 ```
 
 Use this table to pick the compression rate that best meets your accuracy/memory budget.
+
+---
+
+### Step 7: Run MMLU eval on all checkpoints
+
+After the sweep you have 7 compressed checkpoints plus the teacher. Evaluate all of them on MMLU (5-shot) to get an external benchmark score alongside the internal sweep losses.
+
+> **User:** evaluate all checkpoints on MMLU
+
+Claude first lists the available checkpoints:
+
+```text
+/puzzletron eval list
+```
+
+Example output:
+
+```text
+#     Label           MMLU    Path
+--------------------------------------------------------------------------------------------
+0     teacher                 /workspace/hf_models/Qwen/Qwen3.5-0.8B
+1     10,000 MiB              .../target_memory_10000MiB.../solution_0
+2     10,194 MiB              .../target_memory_10194_.../solution_0
+3     12,233 MiB              .../target_memory_12233_.../solution_0
+4     14,272 MiB              .../target_memory_14272_.../solution_0
+5     16,311 MiB              .../target_memory_16310_.../solution_0
+6     18,350 MiB              .../target_memory_18349_.../solution_0
+7     20,389 MiB              .../target_memory_20388_.../solution_0
+
+Usage: /puzzletron eval mmlu <index>
+       /puzzletron eval mmlu <index> --limit 10   (smoke test)
+```
+
+Then runs all 8 checkpoints sequentially in a background task (`/puzzletron eval mmlu 0` through `7`). Results are saved next to each checkpoint at `<checkpoint>/eval_results/mmlu/`.
+
+> **User:** show eval progress
+
+```text
+/puzzletron eval progress
+```
+
+Example output mid-run:
+
+```text
+MMlu eval progress  (3/8 done)
+──────────────────────────────────────────────────────────────────
+  Status      Checkpoint       MMLU acc  Path
+──────────────────────────────────────────────────────────────────
+  [DONE]      teacher            0.5038  /workspace/hf_models/Qwen/Qwen3.5-0.8B
+  [DONE]      10,000 MiB         0.2365  .../target_memory_10000MiB.../solution_0
+  [DONE]      10,194 MiB         0.2417  .../target_memory_10194_.../solution_0
+  [RUNNING]   12,233 MiB            ...  .../target_memory_12233_.../solution_0
+  [ ]         14,272 MiB        pending
+  [ ]         16,311 MiB        pending
+  [ ]         18,350 MiB        pending
+  [ ]         20,389 MiB        pending
+──────────────────────────────────────────────────────────────────
+  Done:    3/8
+  Running: 12,233 MiB
+  Pending: 14,272 MiB, 16,311 MiB, 18,350 MiB, 20,389 MiB
+```
+
+> **User:** show both in one table
+
+Claude joins the sweep CSV with the per-checkpoint MMLU JSON results:
+
+```text
+  rate    target_mem    actual_mem     num_params   lm_loss   top_1   top_5  top_10    MMLU
+----------------------------------------------------------------------------------------------------
+  teacher                                                                               0.5038
+    0.50      10,194.4      10,143.3    888,813,280    3.2367  0.3663  0.6384  0.7251  0.2417
+    0.60      12,233.2      11,719.5    909,901,856    2.6377  0.4434  0.7198  0.7981  0.2446
+    0.70      14,272.1      14,083.8    941,534,720    1.8532  0.5855  0.8176  0.8735  0.2374
+    0.80      16,311.0      15,660.1    962,623,296    1.5385  0.6448  0.8576  0.9046  0.2636
+    0.90      18,349.9      18,024.4    994,256,160    1.2447  0.7064  0.8914  0.9278  0.3119
+    1.00      20,388.7      20,388.7  1,025,889,024    1.1067  0.7365  0.9079  0.9399  0.5038
+```
+
+**Key observations for Qwen3.5-0.8B:**
+
+- The 1.0 rate (20,389 MiB) matches teacher MMLU exactly (0.5038) — a useful sanity check.
+- Below rate 0.9, MMLU drops to ~0.24 (near random chance for 4-choice) even though token-level `top_1` still improves steadily. MMLU is a much stricter signal than token accuracy.
+- The 0.9 rate (18,350 MiB, ~90 % of teacher memory) is the only compressed model with any meaningful MMLU recovery (0.31).
@@ -0,0 +1,138 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Generated with Claude Code
+"""List available checkpoints for MMLU eval.
+
+Usage:
+  python eval_list.py [puzzle_dir]
+
+If puzzle_dir is omitted, scans for puzzle_dir_* directories under the repo
+root. If exactly one is found it is used automatically; if multiple are found
+they are listed and the script exits asking the user to specify one.
+
+Output columns (tab-separated):
+  #    Label        MMLU    Path
+
+MMLU column shows "done" if eval_results/mmlu/ already exists.
+"""
+
+import glob
+import os
+import re
+import sys
+
+
+def parse_memory_mib(dir_name):
+    """Parse memory in MiB from a target_memory_<val>MiB directory name.
+
+    Directory names encode decimal values with underscores replacing the decimal
+    point, e.g. target_memory_10194_364013671875MiB -> 10194.364013671875 MiB.
+    Round numbers like target_memory_10000MiB have no underscore.
+    """
+    m = re.search(r"target_memory_([\d_]+)MiB", dir_name)
+    if not m:
+        return None
+    parts = m.group(1).split("_", 1)
+    return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}")
+
+
+def find_teacher(puzzle_dir):
+    """Find the original HF model path from the YAML config matching puzzle_dir."""
+    puzzle_dir_abs = os.path.abspath(puzzle_dir)
+    puzzle_dir_name = os.path.basename(puzzle_dir_abs)
+    for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True):
+        try:
+            content = open(yaml_path).read()
+        except OSError:
+            continue
+        if puzzle_dir_abs in content or puzzle_dir_name in content:
+            m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE)
+            if m:
+                return m.group(1)
+    return None
+
+
+def list_checkpoints(puzzle_dir):
+    """Print available checkpoints (teacher + sweep solutions) with MMLU eval status."""
+    teacher_path = find_teacher(puzzle_dir)
+
+    ckpt_dirs = sorted(
+        glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0")
+    )
+
+    entries = []
+    if teacher_path:
+        entries.append(("teacher", teacher_path))
+    for ckpt in ckpt_dirs:
+        dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0]
+        mem = parse_memory_mib(dir_name)
+        label = f"{mem:,.0f} MiB" if mem is not None else dir_name
+        entries.append((label, ckpt))
+
+    if not entries:
+        print(f"No checkpoints found under {puzzle_dir}.")
+        sys.exit(1)
+
+    def has_mmlu(path):
+        return os.path.isdir(os.path.join(path, "eval_results", "mmlu"))
+
+    max_label = max(len(e[0]) for e in entries)
+    print(f"\n{'#':<4}  {'Label':<{max_label}}  {'MMLU':^6}  Path")
+    print("-" * (4 + 2 + max_label + 2 + 6 + 2 + 60))
+    for i, (label, path) in enumerate(entries):
+        eval_mark = "done" if has_mmlu(path) else ""
+        print(f"{i:<4}  {label:<{max_label}}  {eval_mark:^6}  {path}")
+    print()
+    print("Usage: /puzzletron eval mmlu <index>")
+    print("       /puzzletron eval mmlu <index> --limit 10   (smoke test)")
+
+
+# --- main ---
+
+if len(sys.argv) > 1:
+    puzzle_dir = sys.argv[1].rstrip("/")
+    if not os.path.isdir(puzzle_dir):
+        print(f"Directory not found: {puzzle_dir}")
+        sys.exit(1)
+    list_checkpoints(puzzle_dir)
+else:
+    # Auto-discover puzzle_dir_* under the repo root
+    candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*"))
+    # Also check /workspace if we're inside it
+    candidates += sorted(glob.glob("/workspace/puzzle_dir_*"))
+    # Deduplicate while preserving order
+    seen: set = set()
+    deduped = []
+    for c in candidates:
+        abs_c = os.path.abspath(c)
+        if abs_c not in seen:
+            seen.add(abs_c)
+            deduped.append(c)
+    candidates = deduped
+    candidates = [c for c in candidates if os.path.isdir(c)]
+
+    if len(candidates) == 1:
+        list_checkpoints(candidates[0])
+    elif len(candidates) == 0:
+        print("No puzzle_dir_* directories found. Please specify the path explicitly:")
+        print("  /puzzletron eval list <puzzle_dir>")
+        sys.exit(1)
+    else:
+        print("Multiple puzzle directories found. Please specify one:")
+        for i, c in enumerate(candidates):
+            print(f"  {i}  {c}")
+        print("\nUsage: /puzzletron eval list <puzzle_dir>")
+        sys.exit(1)