Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 74 additions & 4 deletions .agents/skills/puzzletron/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: puzzletron
description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model. Usage: /puzzletron <command> [args]"
description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval (list/mmlu). Usage: /puzzletron <command> [args]"
license: Apache-2.0
---

Expand All @@ -11,7 +11,7 @@ license: Apache-2.0
**STEP 1 — Check args before doing anything else. This is MANDATORY.**

- If args are **empty**, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
- If the first word of args does **not exactly match** `mip`, `all`, or `add-model`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**
- If the first word of args does **not exactly match** `mip`, `all`, `add-model`, or `eval`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.**

---

Expand All @@ -25,6 +25,9 @@ Available commands:
- `all <nproc_per_node>` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node)
- `all progress` — Show live full pipeline progress with timing summary
- `add-model <hf_model_path>` — Implement descriptor, converter, and configs for an unsupported model
- `eval list [puzzle_dir]` — List all available checkpoints (teacher + sweep solutions) with their index numbers; auto-discovers puzzle_dir if omitted
- `eval progress [puzzle_dir]` — Show per-checkpoint MMLU eval status and accuracy; auto-discovers puzzle_dir if omitted
- `eval mmlu <index|hf_model_path> [--puzzle_dir <dir>] [--limit <N>] [--batch_size <B>]` — Evaluate a checkpoint on MMLU (5-shot); pass index from `eval list` or a direct path; add `--limit N` for a smoke test

Usage: `/puzzletron <command> [args]`

Expand All @@ -48,7 +51,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:

```bash
set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
--config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
2>&1 | tee ./log.txt | grep "Puzzletron Progress"
Expand Down Expand Up @@ -82,7 +85,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax:
Run the following Bash command, substituting `<nproc_per_node>` with the parsed value:

```bash
set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \
set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \
torchrun --nproc_per_node <nproc_per_node> examples/puzzletron/main.py \
--config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
--mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
Expand Down Expand Up @@ -114,6 +117,73 @@ Run the following Bash command. Present the output to the user wrapped in a fenc
python3 .agents/skills/puzzletron/mip_sweep.py
```

## Command: eval

- If the second word is not exactly `list`, `progress`, or `mmlu`, tell the user: "Unknown eval sub-command. Available: `list`, `progress`, `mmlu`." and **STOP**.

### eval list

Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.

Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:

```bash
python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]
```

Present the output to the user wrapped in a fenced code block (``` ... ```).

### eval progress

Parse `puzzle_dir` from args (third word or `--puzzle_dir <value>`). It is optional.

Run the following Bash command, including `<puzzle_dir>` as an argument when provided, or omitting it to trigger auto-discovery:

```bash
python3 .agents/skills/puzzletron/eval_progress.py [<puzzle_dir>]
```

Present the output to the user wrapped in a fenced code block (``` ... ```).

### eval mmlu

Parse args:
- `index_or_path` — third word. If missing, ask: "Please provide a checkpoint index (from `eval list`) or a direct HF model path." and **STOP**.
- `--puzzle_dir <dir>` — optional; used when resolving an index.
- If `index_or_path` matches `^[0-9]+$`, resolve it to a path by running `python3 .agents/skills/puzzletron/eval_list.py [<puzzle_dir>]` and picking the Nth entry (0-based) from the output lines. If the index is out of range, tell the user and **STOP**.
- Otherwise treat `index_or_path` as a literal `hf_model_path`.
- `--limit <N>` — optional integer; omit the flag entirely if not provided.
- `--batch_size <B>` — optional integer; default `4` if not provided.

Derive `output_path` as `<hf_model_path>/eval_results/mmlu` (always; not user-configurable).

Run the following Bash command, substituting the parsed values:

```bash
PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \
--model hf \
--model_args pretrained=<hf_model_path>,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size <batch_size> \
--output_path <hf_model_path>/eval_results/mmlu \
[--limit <N>]
```

(Replace `[--limit <N>]` with the actual `--limit <N>` flag when provided, or omit it.)

**Note on output file location:** lm_eval does not write results directly into `--output_path`. It creates a subdirectory named after the full model path with `/` replaced by `__`, then writes `results_<timestamp>.json` inside it. For example, for a model at `/workspace/foo/bar`, results land at:

```text
<output_path>/__workspace__foo__bar/results_<timestamp>.json
```

`eval_progress.py` handles this automatically via a recursive glob.

Stream output to the user as it arrives. When the command finishes:
- Report the exit code.
- Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output.

## Command: add-model

Parse `hf_model_path` from args (the second word). If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**.
Expand Down
83 changes: 83 additions & 0 deletions .agents/skills/puzzletron/adding_new_model_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,3 +207,86 @@ Results from: /workspace/puzzle_dir_qwen3_5-0.8b/mip_sweep_results.csv
```

Use this table to pick the compression rate that best meets your accuracy/memory budget.

---

### Step 7: Run MMLU eval on all checkpoints

After the sweep you have 7 compressed checkpoints plus the teacher. Evaluate all of them on MMLU (5-shot) to get an external benchmark score alongside the internal sweep losses.

> **User:** evaluate all checkpoints on MMLU

Claude first lists the available checkpoints:

```text
/puzzletron eval list
```

Example output:

```text
# Label MMLU Path
--------------------------------------------------------------------------------------------
0 teacher /workspace/hf_models/Qwen/Qwen3.5-0.8B
1 10,000 MiB .../target_memory_10000MiB.../solution_0
2 10,194 MiB .../target_memory_10194_.../solution_0
3 12,233 MiB .../target_memory_12233_.../solution_0
4 14,272 MiB .../target_memory_14272_.../solution_0
5 16,311 MiB .../target_memory_16310_.../solution_0
6 18,350 MiB .../target_memory_18349_.../solution_0
7 20,389 MiB .../target_memory_20388_.../solution_0

Usage: /puzzletron eval mmlu <index>
/puzzletron eval mmlu <index> --limit 10 (smoke test)
```

Then runs all 8 checkpoints sequentially in a background task (`/puzzletron eval mmlu 0` through `7`). Results are saved next to each checkpoint at `<checkpoint>/eval_results/mmlu/`.

> **User:** show eval progress

```text
/puzzletron eval progress
```

Example output mid-run:

```text
MMlu eval progress (3/8 done)
──────────────────────────────────────────────────────────────────
Status Checkpoint MMLU acc Path
──────────────────────────────────────────────────────────────────
[DONE] teacher 0.5038 /workspace/hf_models/Qwen/Qwen3.5-0.8B
[DONE] 10,000 MiB 0.2365 .../target_memory_10000MiB.../solution_0
[DONE] 10,194 MiB 0.2417 .../target_memory_10194_.../solution_0
[RUNNING] 12,233 MiB ... .../target_memory_12233_.../solution_0
[ ] 14,272 MiB pending
[ ] 16,311 MiB pending
[ ] 18,350 MiB pending
[ ] 20,389 MiB pending
──────────────────────────────────────────────────────────────────
Done: 3/8
Running: 12,233 MiB
Pending: 14,272 MiB, 16,311 MiB, 18,350 MiB, 20,389 MiB
```

> **User:** show both in one table

Claude joins the sweep CSV with the per-checkpoint MMLU JSON results:

```text
rate target_mem actual_mem num_params lm_loss top_1 top_5 top_10 MMLU
----------------------------------------------------------------------------------------------------
teacher 0.5038
0.50 10,194.4 10,143.3 888,813,280 3.2367 0.3663 0.6384 0.7251 0.2417
0.60 12,233.2 11,719.5 909,901,856 2.6377 0.4434 0.7198 0.7981 0.2446
0.70 14,272.1 14,083.8 941,534,720 1.8532 0.5855 0.8176 0.8735 0.2374
0.80 16,311.0 15,660.1 962,623,296 1.5385 0.6448 0.8576 0.9046 0.2636
0.90 18,349.9 18,024.4 994,256,160 1.2447 0.7064 0.8914 0.9278 0.3119
1.00 20,388.7 20,388.7 1,025,889,024 1.1067 0.7365 0.9079 0.9399 0.5038
```

**Key observations for Qwen3.5-0.8B:**

- The 1.0 rate (20,389 MiB) matches teacher MMLU exactly (0.5038) — a useful sanity check.
- Below rate 0.9, MMLU drops to ~0.24 (near random chance for 4-choice) even though token-level `top_1` still improves steadily. MMLU is a much stricter signal than token accuracy.
- The 0.9 rate (18,350 MiB, ~90 % of teacher memory) is the only compressed model with any meaningful MMLU recovery (0.31).
138 changes: 138 additions & 0 deletions .agents/skills/puzzletron/eval_list.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Generated with Claude Code
"""List available checkpoints for MMLU eval.

Usage:
python eval_list.py [puzzle_dir]

If puzzle_dir is omitted, scans for puzzle_dir_* directories under the repo
root. If exactly one is found it is used automatically; if multiple are found
they are listed and the script exits asking the user to specify one.

Output columns (tab-separated):
# Label MMLU Path

MMLU column shows "done" if eval_results/mmlu/ already exists.
"""

import glob
import os
import re
import sys


def parse_memory_mib(dir_name):
"""Parse memory in MiB from a target_memory_<val>MiB directory name.

Directory names encode decimal values with underscores replacing the decimal
point, e.g. target_memory_10194_364013671875MiB -> 10194.364013671875 MiB.
Round numbers like target_memory_10000MiB have no underscore.
"""
m = re.search(r"target_memory_([\d_]+)MiB", dir_name)
if not m:
return None
parts = m.group(1).split("_", 1)
return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}")


def find_teacher(puzzle_dir):
"""Find the original HF model path from the YAML config matching puzzle_dir."""
puzzle_dir_abs = os.path.abspath(puzzle_dir)
puzzle_dir_name = os.path.basename(puzzle_dir_abs)
for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True):
try:
content = open(yaml_path).read()
except OSError:
continue
if puzzle_dir_abs in content or puzzle_dir_name in content:
m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE)
if m:
return m.group(1)
return None


def list_checkpoints(puzzle_dir):
"""Print available checkpoints (teacher + sweep solutions) with MMLU eval status."""
teacher_path = find_teacher(puzzle_dir)

ckpt_dirs = sorted(
glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0")
)

entries = []
if teacher_path:
entries.append(("teacher", teacher_path))
for ckpt in ckpt_dirs:
dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0]
mem = parse_memory_mib(dir_name)
label = f"{mem:,.0f} MiB" if mem is not None else dir_name
entries.append((label, ckpt))

if not entries:
print(f"No checkpoints found under {puzzle_dir}.")
sys.exit(1)

def has_mmlu(path):
return os.path.isdir(os.path.join(path, "eval_results", "mmlu"))

max_label = max(len(e[0]) for e in entries)
print(f"\n{'#':<4} {'Label':<{max_label}} {'MMLU':^6} Path")
print("-" * (4 + 2 + max_label + 2 + 6 + 2 + 60))
for i, (label, path) in enumerate(entries):
eval_mark = "done" if has_mmlu(path) else ""
print(f"{i:<4} {label:<{max_label}} {eval_mark:^6} {path}")
print()
print("Usage: /puzzletron eval mmlu <index>")
print(" /puzzletron eval mmlu <index> --limit 10 (smoke test)")


# --- main ---

if len(sys.argv) > 1:
puzzle_dir = sys.argv[1].rstrip("/")
if not os.path.isdir(puzzle_dir):
print(f"Directory not found: {puzzle_dir}")
sys.exit(1)
list_checkpoints(puzzle_dir)
else:
# Auto-discover puzzle_dir_* under the repo root
candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*"))
# Also check /workspace if we're inside it
candidates += sorted(glob.glob("/workspace/puzzle_dir_*"))
# Deduplicate while preserving order
seen: set = set()
deduped = []
for c in candidates:
abs_c = os.path.abspath(c)
if abs_c not in seen:
seen.add(abs_c)
deduped.append(c)
candidates = deduped
candidates = [c for c in candidates if os.path.isdir(c)]

if len(candidates) == 1:
list_checkpoints(candidates[0])
elif len(candidates) == 0:
print("No puzzle_dir_* directories found. Please specify the path explicitly:")
print(" /puzzletron eval list <puzzle_dir>")
sys.exit(1)
else:
print("Multiple puzzle directories found. Please specify one:")
for i, c in enumerate(candidates):
print(f" {i} {c}")
print("\nUsage: /puzzletron eval list <puzzle_dir>")
sys.exit(1)
Loading
Loading