From aa009fb7e1549ba1ab9489d4a36ceaf261c6f5bb Mon Sep 17 00:00:00 2001
From: Daniel Korzekwa <dkorzekwa@nvidia.com>
Date: Mon, 22 Jun 2026 06:04:36 -0700
Subject: [PATCH 1/3] add puzzletron eval mmlu command

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
---
 .agents/skills/puzzletron/SKILL.md | 38 ++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/.agents/skills/puzzletron/SKILL.md b/.agents/skills/puzzletron/SKILL.md
index a18398d8fd7..eac1fba3d5e 100644
--- a/.agents/skills/puzzletron/SKILL.md
+++ b/.agents/skills/puzzletron/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: puzzletron
-description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model. Usage: /puzzletron <command> [args]"
+description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval. Usage: /puzzletron <command> [args]"
 license: Apache-2.0
 ---
 
@@ -11,7 +11,7 @@ license: Apache-2.0
 **STEP 1 — Check args before doing anything else. This is MANDATORY.**
 
 - If args are **empty**, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
-- If the first word of args does **not exactly match** `mip`, `all`, or `add-model`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
+- If the first word of args does **not exactly match** `mip`, `all`, `add-model`, or `eval`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
 
 ---
 
@@ -25,6 +25,7 @@ Available commands:
 - `all <nproc_per_node>` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node)
 - `all progress` — Show live full pipeline progress with timing summary
 - `add-model <hf_model_path>` — Implement descriptor, converter, and configs for an unsupported model
+- `eval mmlu <hf_model_path> [--limit <N>] [--batch_size <B>]` — Evaluate a checkpoint on MMLU (5-shot); add `--limit N` for a smoke test
 
 Usage: `/puzzletron <command> [args]`
 
@@ -48,7 +49,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
 Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:
 
 ```bash
-set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
+set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
 torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
   --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
   2>&1 | tee ./log.txt | grep "Puzzletron Progress"
@@ -82,7 +83,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
 Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:
 
 ```bash
-set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
+set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
 torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
   --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
   --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
@@ -114,6 +115,35 @@ Run the following Bash command. Present the output to the user wrapped in a fenc
 python3 .agents/skills/puzzletron/mip_sweep.py
 ```
 
+## Command: eval
+
+- If the second word is not exactly `mmlu`, tell the user: "Unknown eval sub-command. Available: `mmlu`." and **STOP**.
+
+### eval mmlu
+
+Parse args:
+- `hf_model_path` — third word (positional) or `--hf_model_path <value>`. If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**.
+- `--limit <N>` — optional integer; omit the flag entirely if not provided.
+- `--batch_size <B>` — optional integer; default `4` if not provided.
+
+Run the following Bash command, substituting the parsed values:
+
+```bash
+PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \
+  --model hf \
+  --model_args pretrained=<hf_model_path>,dtype=bfloat16,parallelize=True \
+  --tasks mmlu \
+  --num_fewshot 5 \
+  --batch_size <batch_size> \
+  [--limit <N>]
+```
+
+(Replace `[--limit <N>]` with the actual `--limit <N>` flag when provided, or omit it.)
+
+Stream output to the user as it arrives. When the command finishes:
+- Report the exit code.
+- Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output.
+
 ## Command: add-model
 
 Parse `hf_model_path` from args (the second word). If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**.

From 38b606d78679dd0721da2b6c737c2a85e22f910d Mon Sep 17 00:00:00 2001
From: Daniel Korzekwa <dkorzekwa@nvidia.com>
Date: Mon, 22 Jun 2026 07:07:47 -0700
Subject: [PATCH 2/3] Update puzzletron eval skill

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
---
 .agents/skills/puzzletron/SKILL.md         |  48 ++++-
 .agents/skills/puzzletron/eval_list.py     | 138 +++++++++++++++
 .agents/skills/puzzletron/eval_progress.py | 196 +++++++++++++++++++++
 3 files changed, 378 insertions(+), 4 deletions(-)
 create mode 100644 .agents/skills/puzzletron/eval_list.py
 create mode 100644 .agents/skills/puzzletron/eval_progress.py

diff --git a/.agents/skills/puzzletron/SKILL.md b/.agents/skills/puzzletron/SKILL.md
index eac1fba3d5e..48064e505b8 100644
--- a/.agents/skills/puzzletron/SKILL.md
+++ b/.agents/skills/puzzletron/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: puzzletron
-description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval. Usage: /puzzletron <command> [args]"
+description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval (list/mmlu). Usage: /puzzletron <command> [args]"
 license: Apache-2.0
 ---
 
@@ -25,7 +25,9 @@ Available commands:
 - `all <nproc_per_node>` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node)
 - `all progress` — Show live full pipeline progress with timing summary
 - `add-model <hf_model_path>` — Implement descriptor, converter, and configs for an unsupported model
-- `eval mmlu <hf_model_path> [--limit <N>] [--batch_size <B>]` — Evaluate a checkpoint on MMLU (5-shot); add `--limit N` for a smoke test
+- `eval list [puzzle_dir]` — List all available checkpoints (teacher + sweep solutions) with their index numbers; auto-discovers puzzle_dir if omitted
+- `eval progress [puzzle_dir]` — Show per-checkpoint MMLU eval status and accuracy; auto-discovers puzzle_dir if omitted
+- `eval mmlu <index|hf_model_path> [--puzzle_dir <dir>] [--limit <N>] [--batch_size <B>]` — Evaluate a checkpoint on MMLU (5-shot); pass index from `eval list` or a direct path; add `--limit N` for a smoke test
 
 Usage: `/puzzletron <command> [args]`
 
@@ -117,15 +119,44 @@ python3 .agents/skills/puzzletron/mip_sweep.py
 
 ## Command: eval
 
-- If the second word is not exactly `mmlu`, tell the user: "Unknown eval sub-command. Available: `mmlu`." and **STOP**.
+- If the second word is not exactly `list`, `progress`, or `mmlu`, tell the user: "Unknown eval sub-command. Available: `list`, `progress`, `mmlu`." and **STOP**.
+
+### eval list
+
+Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.
+
+Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:
+
+```bash
+python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]
+```
+
+Present the output to the user wrapped in a fenced code block (``` ... ```).
+
+### eval progress
+
+Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.
+
+Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:
+
+```bash
+python3 .agents/skills/puzzletron/eval_progress.py [<puzzle_dir>]
+```
+
+Present the output to the user wrapped in a fenced code block (``` ... ```).
 
 ### eval mmlu
 
 Parse args:
-- `hf_model_path` — third word (positional) or `--hf_model_path <value>`. If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**.
+- `index_or_path` — third word. If missing, ask: "Please provide a checkpoint index (from `eval list`) or a direct HF model path." and **STOP**.
+- `--puzzle_dir <dir>` — optional; used when resolving an index.
+- If `index_or_path` matches `^[0-9]+$`, resolve it to a path by running `python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]` and picking the Nth entry (0-based) from the output lines. If the index is out of range, tell the user and **STOP**.
+- Otherwise treat `index_or_path` as a literal `hf_model_path`.
 - `--limit <N>` — optional integer; omit the flag entirely if not provided.
 - `--batch_size <B>` — optional integer; default `4` if not provided.
 
+Derive `output_path` as `<hf_model_path>/eval_results/mmlu` (always; not user-configurable).
+
 Run the following Bash command, substituting the parsed values:
 
 ```bash
@@ -135,11 +166,20 @@ PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \
   --tasks mmlu \
   --num_fewshot 5 \
   --batch_size <batch_size> \
+  --output_path <hf_model_path>/eval_results/mmlu \
   [--limit <N>]
 ```
 
 (Replace `[--limit <N>]` with the actual `--limit <N>` flag when provided, or omit it.)
 
+**Note on output file location:** lm_eval does not write results directly into `--output_path`. It creates a subdirectory named after the full model path with `/` replaced by `__`, then writes `results_<timestamp>.json` inside it. For example, for a model at `/workspace/foo/bar`, results land at:
+
+```text
+<output_path>/__workspace__foo__bar/results_<timestamp>.json
+```
+
+`eval_progress.py` handles this automatically via a recursive glob.
+
 Stream output to the user as it arrives. When the command finishes:
 - Report the exit code.
 - Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output.
diff --git a/.agents/skills/puzzletron/eval_list.py b/.agents/skills/puzzletron/eval_list.py
new file mode 100644
index 00000000000..eab99c93524
--- /dev/null
+++ b/.agents/skills/puzzletron/eval_list.py
@@ -0,0 +1,138 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Generated with Claude Code
+"""List available checkpoints for MMLU eval.
+
+Usage:
+  python eval_list.py [puzzle_dir]
+
+If puzzle_dir is omitted, scans for puzzle_dir_* directories under the repo
+root. If exactly one is found it is used automatically; if multiple are found
+they are listed and the script exits asking the user to specify one.
+
+Output columns (tab-separated):
+  #    Label        MMLU    Path
+
+MMLU column shows "done" if eval_results/mmlu/ already exists.
+"""
+
+import glob
+import os
+import re
+import sys
+
+
+def parse_memory_mib(dir_name):
+    """Parse memory in MiB from a target_memory_<val>MiB directory name.
+
+    Directory names encode decimal values with underscores replacing the decimal
+    point, e.g. target_memory_10194_364013671875MiB -> 10194.364013671875 MiB.
+    Round numbers like target_memory_10000MiB have no underscore.
+    """
+    m = re.search(r"target_memory_([\d_]+)MiB", dir_name)
+    if not m:
+        return None
+    parts = m.group(1).split("_", 1)
+    return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}")
+
+
+def find_teacher(puzzle_dir):
+    """Find the original HF model path from the YAML config matching puzzle_dir."""
+    puzzle_dir_abs = os.path.abspath(puzzle_dir)
+    puzzle_dir_name = os.path.basename(puzzle_dir_abs)
+    for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True):
+        try:
+            content = open(yaml_path).read()
+        except OSError:
+            continue
+        if puzzle_dir_abs in content or puzzle_dir_name in content:
+            m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE)
+            if m:
+                return m.group(1)
+    return None
+
+
+def list_checkpoints(puzzle_dir):
+    """Print available checkpoints (teacher + sweep solutions) with MMLU eval status."""
+    teacher_path = find_teacher(puzzle_dir)
+
+    ckpt_dirs = sorted(
+        glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0")
+    )
+
+    entries = []
+    if teacher_path:
+        entries.append(("teacher", teacher_path))
+    for ckpt in ckpt_dirs:
+        dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0]
+        mem = parse_memory_mib(dir_name)
+        label = f"{mem:,.0f} MiB" if mem is not None else dir_name
+        entries.append((label, ckpt))
+
+    if not entries:
+        print(f"No checkpoints found under {puzzle_dir}.")
+        sys.exit(1)
+
+    def has_mmlu(path):
+        return os.path.isdir(os.path.join(path, "eval_results", "mmlu"))
+
+    max_label = max(len(e[0]) for e in entries)
+    print(f"\n{'#':<4}  {'Label':<{max_label}}  {'MMLU':^6}  Path")
+    print("-" * (4 + 2 + max_label + 2 + 6 + 2 + 60))
+    for i, (label, path) in enumerate(entries):
+        eval_mark = "done" if has_mmlu(path) else ""
+        print(f"{i:<4}  {label:<{max_label}}  {eval_mark:^6}  {path}")
+    print()
+    print("Usage: /puzzletron eval mmlu <index>")
+    print("       /puzzletron eval mmlu <index> --limit 10   (smoke test)")
+
+
+# --- main ---
+
+if len(sys.argv) > 1:
+    puzzle_dir = sys.argv[1].rstrip("/")
+    if not os.path.isdir(puzzle_dir):
+        print(f"Directory not found: {puzzle_dir}")
+        sys.exit(1)
+    list_checkpoints(puzzle_dir)
+else:
+    # Auto-discover puzzle_dir_* under the repo root
+    candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*"))
+    # Also check /workspace if we're inside it
+    candidates += sorted(glob.glob("/workspace/puzzle_dir_*"))
+    # Deduplicate while preserving order
+    seen: set = set()
+    deduped = []
+    for c in candidates:
+        abs_c = os.path.abspath(c)
+        if abs_c not in seen:
+            seen.add(abs_c)
+            deduped.append(c)
+    candidates = deduped
+    candidates = [c for c in candidates if os.path.isdir(c)]
+
+    if len(candidates) == 1:
+        list_checkpoints(candidates[0])
+    elif len(candidates) == 0:
+        print("No puzzle_dir_* directories found. Please specify the path explicitly:")
+        print("  /puzzletron eval list <puzzle_dir>")
+        sys.exit(1)
+    else:
+        print("Multiple puzzle directories found. Please specify one:")
+        for i, c in enumerate(candidates):
+            print(f"  {i}  {c}")
+        print("\nUsage: /puzzletron eval list <puzzle_dir>")
+        sys.exit(1)
diff --git a/.agents/skills/puzzletron/eval_progress.py b/.agents/skills/puzzletron/eval_progress.py
new file mode 100644
index 00000000000..fa56b5834ba
--- /dev/null
+++ b/.agents/skills/puzzletron/eval_progress.py
@@ -0,0 +1,196 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Generated with Claude Code
+"""Progress report for /puzzletron eval mmlu (all checkpoints).
+
+Usage:
+  python eval_progress.py [puzzle_dir]
+
+Reads checkpoint list from eval_list.py and checks eval_results/mmlu/ presence
+plus JSON results to determine status and accuracy for each checkpoint.
+"""
+
+import contextlib
+import glob
+import json
+import os
+import re
+import subprocess  # nosec B404
+import sys
+
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+
+
+def parse_memory_mib(dir_name):
+    """Parse memory in MiB from a target_memory_<val>MiB directory name."""
+    m = re.search(r"target_memory_([\d_]+)MiB", dir_name)
+    if not m:
+        return None
+    parts = m.group(1).split("_", 1)
+    return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}")
+
+
+def find_teacher(puzzle_dir):
+    """Find the original HF model path from the YAML config matching puzzle_dir."""
+    puzzle_dir_abs = os.path.abspath(puzzle_dir)
+    puzzle_dir_name = os.path.basename(puzzle_dir_abs)
+    for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True):
+        try:
+            content = open(yaml_path).read()
+        except OSError:
+            continue
+        if puzzle_dir_abs in content or puzzle_dir_name in content:
+            m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE)
+            if m:
+                return m.group(1)
+    return None
+
+
+def get_checkpoints(puzzle_dir):
+    """Return list of (label, path) for teacher + all sweep solution checkpoints."""
+    teacher_path = find_teacher(puzzle_dir)
+    ckpt_dirs = sorted(
+        glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0")
+    )
+    entries = []
+    if teacher_path:
+        entries.append(("teacher", teacher_path))
+    for ckpt in ckpt_dirs:
+        dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0]
+        mem = parse_memory_mib(dir_name)
+        label = f"{mem:,.0f} MiB" if mem is not None else dir_name
+        entries.append((label, ckpt))
+    return entries
+
+
+def get_running_checkpoint():
+    """Return the checkpoint path currently being evaluated by lm_eval, or None."""
+    try:
+        result = subprocess.run(  # nosec B603 B607
+            ["ps", "-ww", "aux"], capture_output=True, text=True
+        )
+        for line in result.stdout.splitlines():
+            if "lm_eval_hf.py" in line and "pretrained=" in line:
+                m = re.search(r"pretrained=(/[^,\s]+)", line)
+                if m:
+                    return m.group(1)
+    except Exception:
+        pass
+    return None
+
+
+def get_mmlu_accuracy(path):
+    """Return overall MMLU accuracy from saved JSON results, or None if not done."""
+    results_dir = os.path.join(path, "eval_results", "mmlu")
+    if not os.path.isdir(results_dir):
+        return None
+    # lm_eval saves results under a subdirectory named after the model path
+    # (slashes replaced by __), then results_<timestamp>.json inside it
+    for fname in glob.glob(f"{results_dir}/**/results_*.json", recursive=True):
+        data = None
+        with contextlib.suppress(OSError, json.JSONDecodeError, KeyError):
+            data = json.load(open(fname))
+        if data is not None:
+            results = data.get("results", {})
+            if "mmlu" in results:
+                return results["mmlu"].get("acc,none") or results["mmlu"].get("acc")
+    # results dir exists but no readable JSON yet — still running
+    return "running"
+
+
+def fmt_acc(acc):
+    """Format accuracy value for display."""
+    if acc is None:
+        return ""
+    if acc == "running":
+        return "running"
+    return f"{acc:.4f}"
+
+
+# --- main ---
+
+puzzle_dir = sys.argv[1].rstrip("/") if len(sys.argv) > 1 else None
+
+if puzzle_dir is None:
+    candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*"))
+    candidates += sorted(glob.glob("/workspace/puzzle_dir_*"))
+    seen: set = set()
+    deduped = []
+    for c in candidates:
+        abs_c = os.path.abspath(c)
+        if abs_c not in seen:
+            seen.add(abs_c)
+            deduped.append(c)
+    candidates = [c for c in deduped if os.path.isdir(c)]
+    if len(candidates) == 1:
+        puzzle_dir = candidates[0]
+    elif len(candidates) == 0:
+        print("No puzzle_dir_* found. Specify: /puzzletron eval progress <puzzle_dir>")
+        sys.exit(1)
+    else:
+        print("Multiple puzzle directories found. Please specify one:")
+        for i, c in enumerate(candidates):
+            print(f"  {i}  {c}")
+        print("\nUsage: /puzzletron eval progress <puzzle_dir>")
+        sys.exit(1)
+
+entries = get_checkpoints(puzzle_dir)
+if not entries:
+    print(f"No checkpoints found under {puzzle_dir}.")
+    sys.exit(1)
+
+running_ckpt = get_running_checkpoint()
+
+done = []
+running = []
+pending = []
+for label, path in entries:
+    acc = get_mmlu_accuracy(path)
+    if acc not in (None, "running"):
+        done.append((label, path))
+    elif acc == "running" or (
+        running_ckpt and os.path.abspath(path) == os.path.abspath(running_ckpt)
+    ):
+        running.append((label, path))
+    else:
+        pending.append((label, path))
+
+DIV = "─" * 66
+print(f"\nMMlu eval progress  ({len(done)}/{len(entries)} done)")
+print(DIV)
+print(f"  {'Status':<10}  {'Checkpoint':<14}  {'MMLU acc':>9}  Path")
+print(DIV)
+for label, path in entries:
+    acc = get_mmlu_accuracy(path)
+    is_running = (acc == "running") or (
+        running_ckpt and os.path.abspath(path) == os.path.abspath(running_ckpt)
+    )
+    if acc not in (None, "running"):
+        status = "[DONE]"
+        acc_str = f"{acc:.4f}"
+    elif is_running:
+        status = "[RUNNING]"
+        acc_str = "..."
+    else:
+        status = "[ ]"
+        acc_str = "pending"
+    print(f"  {status:<10}  {label:<14}  {acc_str:>9}  {path}")
+print(DIV)
+print(f"  Done:    {len(done)}/{len(entries)}")
+if running:
+    print(f"  Running: {', '.join(lbl for lbl, _ in running)}")
+if pending:
+    print(f"  Pending: {', '.join(lbl for lbl, _ in pending)}")

From c1fbcc432172145351a084b6122907fdfa7ec7e6 Mon Sep 17 00:00:00 2001
From: Daniel Korzekwa <dkorzekwa@nvidia.com>
Date: Mon, 22 Jun 2026 08:16:27 -0700
Subject: [PATCH 3/3] add puzzletron eval to the tutorial

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
---
 .../puzzletron/adding_new_model_tutorial.md   | 83 +++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/.agents/skills/puzzletron/adding_new_model_tutorial.md b/.agents/skills/puzzletron/adding_new_model_tutorial.md
index 275380297a0..cf5fe6acd10 100644
--- a/.agents/skills/puzzletron/adding_new_model_tutorial.md
+++ b/.agents/skills/puzzletron/adding_new_model_tutorial.md
@@ -207,3 +207,86 @@ Results from: /workspace/puzzle_dir_qwen3_5-0.8b/mip_sweep_results.csv
 ```
 
 Use this table to pick the compression rate that best meets your accuracy/memory budget.
+
+---
+
+### Step 7: Run MMLU eval on all checkpoints
+
+After the sweep you have 7 compressed checkpoints plus the teacher. Evaluate all of them on MMLU (5-shot) to get an external benchmark score alongside the internal sweep losses.
+
+> **User:** evaluate all checkpoints on MMLU
+
+Claude first lists the available checkpoints:
+
+```text
+/puzzletron eval list
+```
+
+Example output:
+
+```text
+#     Label           MMLU    Path
+--------------------------------------------------------------------------------------------
+0     teacher                 /workspace/hf_models/Qwen/Qwen3.5-0.8B
+1     10,000 MiB              .../target_memory_10000MiB.../solution_0
+2     10,194 MiB              .../target_memory_10194_.../solution_0
+3     12,233 MiB              .../target_memory_12233_.../solution_0
+4     14,272 MiB              .../target_memory_14272_.../solution_0
+5     16,311 MiB              .../target_memory_16310_.../solution_0
+6     18,350 MiB              .../target_memory_18349_.../solution_0
+7     20,389 MiB              .../target_memory_20388_.../solution_0
+
+Usage: /puzzletron eval mmlu <index>
+       /puzzletron eval mmlu <index> --limit 10   (smoke test)
+```
+
+Then runs all 8 checkpoints sequentially in a background task (`/puzzletron eval mmlu 0` through `7`). Results are saved next to each checkpoint at `<checkpoint>/eval_results/mmlu/`.
+
+> **User:** show eval progress
+
+```text
+/puzzletron eval progress
+```
+
+Example output mid-run:
+
+```text
+MMlu eval progress  (3/8 done)
+──────────────────────────────────────────────────────────────────
+  Status      Checkpoint       MMLU acc  Path
+──────────────────────────────────────────────────────────────────
+  [DONE]      teacher            0.5038  /workspace/hf_models/Qwen/Qwen3.5-0.8B
+  [DONE]      10,000 MiB         0.2365  .../target_memory_10000MiB.../solution_0
+  [DONE]      10,194 MiB         0.2417  .../target_memory_10194_.../solution_0
+  [RUNNING]   12,233 MiB            ...  .../target_memory_12233_.../solution_0
+  [ ]         14,272 MiB        pending
+  [ ]         16,311 MiB        pending
+  [ ]         18,350 MiB        pending
+  [ ]         20,389 MiB        pending
+──────────────────────────────────────────────────────────────────
+  Done:    3/8
+  Running: 12,233 MiB
+  Pending: 14,272 MiB, 16,311 MiB, 18,350 MiB, 20,389 MiB
+```
+
+> **User:** show both in one table
+
+Claude joins the sweep CSV with the per-checkpoint MMLU JSON results:
+
+```text
+  rate    target_mem    actual_mem     num_params   lm_loss   top_1   top_5  top_10    MMLU
+----------------------------------------------------------------------------------------------------
+  teacher                                                                               0.5038
+    0.50      10,194.4      10,143.3    888,813,280    3.2367  0.3663  0.6384  0.7251  0.2417
+    0.60      12,233.2      11,719.5    909,901,856    2.6377  0.4434  0.7198  0.7981  0.2446
+    0.70      14,272.1      14,083.8    941,534,720    1.8532  0.5855  0.8176  0.8735  0.2374
+    0.80      16,311.0      15,660.1    962,623,296    1.5385  0.6448  0.8576  0.9046  0.2636
+    0.90      18,349.9      18,024.4    994,256,160    1.2447  0.7064  0.8914  0.9278  0.3119
+    1.00      20,388.7      20,388.7  1,025,889,024    1.1067  0.7365  0.9079  0.9399  0.5038
+```
+
+**Key observations for Qwen3.5-0.8B:**
+
+- The 1.0 rate (20,389 MiB) matches teacher MMLU exactly (0.5038) — a useful sanity check.
+- Below rate 0.9, MMLU drops to ~0.24 (near random chance for 4-choice) even though token-level `top_1` still improves steadily. MMLU is a much stricter signal than token accuracy.
+- The 0.9 rate (18,350 MiB, ~90 % of teacher memory) is the only compressed model with any meaningful MMLU recovery (0.31).