From aa009fb7e1549ba1ab9489d4a36ceaf261c6f5bb Mon Sep 17 00:00:00 2001 From: Daniel Korzekwa Date: Mon, 22 Jun 2026 06:04:36 -0700 Subject: [PATCH 1/3] add puzzletron eval mmlu command Signed-off-by: Daniel Korzekwa --- .agents/skills/puzzletron/SKILL.md | 38 ++++++++++++++++++++++++++---- 1 file changed, 34 insertions(+), 4 deletions(-) diff --git a/.agents/skills/puzzletron/SKILL.md b/.agents/skills/puzzletron/SKILL.md index a18398d8fd7..eac1fba3d5e 100644 --- a/.agents/skills/puzzletron/SKILL.md +++ b/.agents/skills/puzzletron/SKILL.md @@ -1,6 +1,6 @@ --- name: puzzletron -description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model. Usage: /puzzletron [args]" +description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval. Usage: /puzzletron [args]" license: Apache-2.0 --- @@ -11,7 +11,7 @@ license: Apache-2.0 **STEP 1 — Check args before doing anything else. This is MANDATORY.** - If args are **empty**, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.** -- If the first word of args does **not exactly match** `mip`, `all`, or `add-model`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.** +- If the first word of args does **not exactly match** `mip`, `all`, `add-model`, or `eval`, output the block below verbatim and **STOP immediately. Do NOT proceed to any command.** --- @@ -25,6 +25,7 @@ Available commands: - `all ` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node) - `all progress` — Show live full pipeline progress with timing summary - `add-model ` — Implement descriptor, converter, and configs for an unsupported model +- `eval mmlu [--limit ] [--batch_size ]` — Evaluate a checkpoint on MMLU (5-shot); add `--limit N` for a smoke test Usage: `/puzzletron [args]` @@ -48,7 +49,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax: Run the following Bash command, substituting `` with the parsed value: ```bash -set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \ +set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \ torchrun --nproc_per_node examples/puzzletron/main.py \ --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \ 2>&1 | tee ./log.txt | grep "Puzzletron Progress" @@ -82,7 +83,7 @@ Parse `nproc_per_node` from args using either positional or flag syntax: Run the following Bash command, substituting `` with the parsed value: ```bash -set -o pipefail && export PYTHONPATH=$PYTHONPATH:/workspace/Model-Optimizer && \ +set -o pipefail && export PYTHONPATH=$PYTHONPATH:. && \ torchrun --nproc_per_node examples/puzzletron/main.py \ --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \ --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress" @@ -114,6 +115,35 @@ Run the following Bash command. Present the output to the user wrapped in a fenc python3 .agents/skills/puzzletron/mip_sweep.py ``` +## Command: eval + +- If the second word is not exactly `mmlu`, tell the user: "Unknown eval sub-command. Available: `mmlu`." and **STOP**. + +### eval mmlu + +Parse args: +- `hf_model_path` — third word (positional) or `--hf_model_path `. If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**. +- `--limit ` — optional integer; omit the flag entirely if not provided. +- `--batch_size ` — optional integer; default `4` if not provided. + +Run the following Bash command, substituting the parsed values: + +```bash +PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \ + --model hf \ + --model_args pretrained=,dtype=bfloat16,parallelize=True \ + --tasks mmlu \ + --num_fewshot 5 \ + --batch_size \ + [--limit ] +``` + +(Replace `[--limit ]` with the actual `--limit ` flag when provided, or omit it.) + +Stream output to the user as it arrives. When the command finishes: +- Report the exit code. +- Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output. + ## Command: add-model Parse `hf_model_path` from args (the second word). If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**. From 38b606d78679dd0721da2b6c737c2a85e22f910d Mon Sep 17 00:00:00 2001 From: Daniel Korzekwa Date: Mon, 22 Jun 2026 07:07:47 -0700 Subject: [PATCH 2/3] Update puzzletron eval skill Signed-off-by: Daniel Korzekwa --- .agents/skills/puzzletron/SKILL.md | 48 ++++- .agents/skills/puzzletron/eval_list.py | 138 +++++++++++++++ .agents/skills/puzzletron/eval_progress.py | 196 +++++++++++++++++++++ 3 files changed, 378 insertions(+), 4 deletions(-) create mode 100644 .agents/skills/puzzletron/eval_list.py create mode 100644 .agents/skills/puzzletron/eval_progress.py diff --git a/.agents/skills/puzzletron/SKILL.md b/.agents/skills/puzzletron/SKILL.md index eac1fba3d5e..48064e505b8 100644 --- a/.agents/skills/puzzletron/SKILL.md +++ b/.agents/skills/puzzletron/SKILL.md @@ -1,6 +1,6 @@ --- name: puzzletron -description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval. Usage: /puzzletron [args]" +description: "End-to-end workflow for model pruning and MIP-based optimization. Commands: mip, all, add-model, eval (list/mmlu). Usage: /puzzletron [args]" license: Apache-2.0 --- @@ -25,7 +25,9 @@ Available commands: - `all ` — Run the full Puzzletron pipeline (nproc_per_node: number of GPUs per node) - `all progress` — Show live full pipeline progress with timing summary - `add-model ` — Implement descriptor, converter, and configs for an unsupported model -- `eval mmlu [--limit ] [--batch_size ]` — Evaluate a checkpoint on MMLU (5-shot); add `--limit N` for a smoke test +- `eval list [puzzle_dir]` — List all available checkpoints (teacher + sweep solutions) with their index numbers; auto-discovers puzzle_dir if omitted +- `eval progress [puzzle_dir]` — Show per-checkpoint MMLU eval status and accuracy; auto-discovers puzzle_dir if omitted +- `eval mmlu [--puzzle_dir ] [--limit ] [--batch_size ]` — Evaluate a checkpoint on MMLU (5-shot); pass index from `eval list` or a direct path; add `--limit N` for a smoke test Usage: `/puzzletron [args]` @@ -117,15 +119,44 @@ python3 .agents/skills/puzzletron/mip_sweep.py ## Command: eval -- If the second word is not exactly `mmlu`, tell the user: "Unknown eval sub-command. Available: `mmlu`." and **STOP**. +- If the second word is not exactly `list`, `progress`, or `mmlu`, tell the user: "Unknown eval sub-command. Available: `list`, `progress`, `mmlu`." and **STOP**. + +### eval list + +Parse `puzzle_dir` from args (third word or `--puzzle_dir `). It is optional. + +Run the following Bash command, including `` as an argument when provided, or omitting it to trigger auto-discovery: + +```bash +python3 .agents/skills/puzzletron/eval_list.py [] +``` + +Present the output to the user wrapped in a fenced code block (``` ... ```). + +### eval progress + +Parse `puzzle_dir` from args (third word or `--puzzle_dir `). It is optional. + +Run the following Bash command, including `` as an argument when provided, or omitting it to trigger auto-discovery: + +```bash +python3 .agents/skills/puzzletron/eval_progress.py [] +``` + +Present the output to the user wrapped in a fenced code block (``` ... ```). ### eval mmlu Parse args: -- `hf_model_path` — third word (positional) or `--hf_model_path `. If missing, ask: "Please provide the HuggingFace model path (local or hub)." and **STOP**. +- `index_or_path` — third word. If missing, ask: "Please provide a checkpoint index (from `eval list`) or a direct HF model path." and **STOP**. +- `--puzzle_dir ` — optional; used when resolving an index. +- If `index_or_path` matches `^[0-9]+$`, resolve it to a path by running `python3 .agents/skills/puzzletron/eval_list.py []` and picking the Nth entry (0-based) from the output lines. If the index is out of range, tell the user and **STOP**. +- Otherwise treat `index_or_path` as a literal `hf_model_path`. - `--limit ` — optional integer; omit the flag entirely if not provided. - `--batch_size ` — optional integer; default `4` if not provided. +Derive `output_path` as `/eval_results/mmlu` (always; not user-configurable). + Run the following Bash command, substituting the parsed values: ```bash @@ -135,11 +166,20 @@ PYTHONPATH=.:$PYTHONPATH python examples/llm_eval/lm_eval_hf.py \ --tasks mmlu \ --num_fewshot 5 \ --batch_size \ + --output_path /eval_results/mmlu \ [--limit ] ``` (Replace `[--limit ]` with the actual `--limit ` flag when provided, or omit it.) +**Note on output file location:** lm_eval does not write results directly into `--output_path`. It creates a subdirectory named after the full model path with `/` replaced by `__`, then writes `results_.json` inside it. For example, for a model at `/workspace/foo/bar`, results land at: + +```text +/__workspace__foo__bar/results_.json +``` + +`eval_progress.py` handles this automatically via a recursive glob. + Stream output to the user as it arrives. When the command finishes: - Report the exit code. - Show a results summary table with: model path, total questions evaluated, loglikelihood requests, and the mmlu/category accuracy scores parsed from the output. diff --git a/.agents/skills/puzzletron/eval_list.py b/.agents/skills/puzzletron/eval_list.py new file mode 100644 index 00000000000..eab99c93524 --- /dev/null +++ b/.agents/skills/puzzletron/eval_list.py @@ -0,0 +1,138 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Generated with Claude Code +"""List available checkpoints for MMLU eval. + +Usage: + python eval_list.py [puzzle_dir] + +If puzzle_dir is omitted, scans for puzzle_dir_* directories under the repo +root. If exactly one is found it is used automatically; if multiple are found +they are listed and the script exits asking the user to specify one. + +Output columns (tab-separated): + # Label MMLU Path + +MMLU column shows "done" if eval_results/mmlu/ already exists. +""" + +import glob +import os +import re +import sys + + +def parse_memory_mib(dir_name): + """Parse memory in MiB from a target_memory_MiB directory name. + + Directory names encode decimal values with underscores replacing the decimal + point, e.g. target_memory_10194_364013671875MiB -> 10194.364013671875 MiB. + Round numbers like target_memory_10000MiB have no underscore. + """ + m = re.search(r"target_memory_([\d_]+)MiB", dir_name) + if not m: + return None + parts = m.group(1).split("_", 1) + return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}") + + +def find_teacher(puzzle_dir): + """Find the original HF model path from the YAML config matching puzzle_dir.""" + puzzle_dir_abs = os.path.abspath(puzzle_dir) + puzzle_dir_name = os.path.basename(puzzle_dir_abs) + for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True): + try: + content = open(yaml_path).read() + except OSError: + continue + if puzzle_dir_abs in content or puzzle_dir_name in content: + m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE) + if m: + return m.group(1) + return None + + +def list_checkpoints(puzzle_dir): + """Print available checkpoints (teacher + sweep solutions) with MMLU eval status.""" + teacher_path = find_teacher(puzzle_dir) + + ckpt_dirs = sorted( + glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0") + ) + + entries = [] + if teacher_path: + entries.append(("teacher", teacher_path)) + for ckpt in ckpt_dirs: + dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0] + mem = parse_memory_mib(dir_name) + label = f"{mem:,.0f} MiB" if mem is not None else dir_name + entries.append((label, ckpt)) + + if not entries: + print(f"No checkpoints found under {puzzle_dir}.") + sys.exit(1) + + def has_mmlu(path): + return os.path.isdir(os.path.join(path, "eval_results", "mmlu")) + + max_label = max(len(e[0]) for e in entries) + print(f"\n{'#':<4} {'Label':<{max_label}} {'MMLU':^6} Path") + print("-" * (4 + 2 + max_label + 2 + 6 + 2 + 60)) + for i, (label, path) in enumerate(entries): + eval_mark = "done" if has_mmlu(path) else "" + print(f"{i:<4} {label:<{max_label}} {eval_mark:^6} {path}") + print() + print("Usage: /puzzletron eval mmlu ") + print(" /puzzletron eval mmlu --limit 10 (smoke test)") + + +# --- main --- + +if len(sys.argv) > 1: + puzzle_dir = sys.argv[1].rstrip("/") + if not os.path.isdir(puzzle_dir): + print(f"Directory not found: {puzzle_dir}") + sys.exit(1) + list_checkpoints(puzzle_dir) +else: + # Auto-discover puzzle_dir_* under the repo root + candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*")) + # Also check /workspace if we're inside it + candidates += sorted(glob.glob("/workspace/puzzle_dir_*")) + # Deduplicate while preserving order + seen: set = set() + deduped = [] + for c in candidates: + abs_c = os.path.abspath(c) + if abs_c not in seen: + seen.add(abs_c) + deduped.append(c) + candidates = deduped + candidates = [c for c in candidates if os.path.isdir(c)] + + if len(candidates) == 1: + list_checkpoints(candidates[0]) + elif len(candidates) == 0: + print("No puzzle_dir_* directories found. Please specify the path explicitly:") + print(" /puzzletron eval list ") + sys.exit(1) + else: + print("Multiple puzzle directories found. Please specify one:") + for i, c in enumerate(candidates): + print(f" {i} {c}") + print("\nUsage: /puzzletron eval list ") + sys.exit(1) diff --git a/.agents/skills/puzzletron/eval_progress.py b/.agents/skills/puzzletron/eval_progress.py new file mode 100644 index 00000000000..fa56b5834ba --- /dev/null +++ b/.agents/skills/puzzletron/eval_progress.py @@ -0,0 +1,196 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Generated with Claude Code +"""Progress report for /puzzletron eval mmlu (all checkpoints). + +Usage: + python eval_progress.py [puzzle_dir] + +Reads checkpoint list from eval_list.py and checks eval_results/mmlu/ presence +plus JSON results to determine status and accuracy for each checkpoint. +""" + +import contextlib +import glob +import json +import os +import re +import subprocess # nosec B404 +import sys + +SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) + + +def parse_memory_mib(dir_name): + """Parse memory in MiB from a target_memory_MiB directory name.""" + m = re.search(r"target_memory_([\d_]+)MiB", dir_name) + if not m: + return None + parts = m.group(1).split("_", 1) + return float(parts[0]) if len(parts) == 1 else float(f"{parts[0]}.{parts[1]}") + + +def find_teacher(puzzle_dir): + """Find the original HF model path from the YAML config matching puzzle_dir.""" + puzzle_dir_abs = os.path.abspath(puzzle_dir) + puzzle_dir_name = os.path.basename(puzzle_dir_abs) + for yaml_path in glob.glob("examples/puzzletron/configs/**/*.yaml", recursive=True): + try: + content = open(yaml_path).read() + except OSError: + continue + if puzzle_dir_abs in content or puzzle_dir_name in content: + m = re.search(r"^input_hf_model_path\s*:\s*(\S+)", content, re.MULTILINE) + if m: + return m.group(1) + return None + + +def get_checkpoints(puzzle_dir): + """Return list of (label, path) for teacher + all sweep solution checkpoints.""" + teacher_path = find_teacher(puzzle_dir) + ckpt_dirs = sorted( + glob.glob(f"{puzzle_dir}/mip/puzzle_solutions/*/solutions--checkpoints/solution_0") + ) + entries = [] + if teacher_path: + entries.append(("teacher", teacher_path)) + for ckpt in ckpt_dirs: + dir_name = ckpt.split("/mip/puzzle_solutions/")[1].split("/")[0] + mem = parse_memory_mib(dir_name) + label = f"{mem:,.0f} MiB" if mem is not None else dir_name + entries.append((label, ckpt)) + return entries + + +def get_running_checkpoint(): + """Return the checkpoint path currently being evaluated by lm_eval, or None.""" + try: + result = subprocess.run( # nosec B603 B607 + ["ps", "-ww", "aux"], capture_output=True, text=True + ) + for line in result.stdout.splitlines(): + if "lm_eval_hf.py" in line and "pretrained=" in line: + m = re.search(r"pretrained=(/[^,\s]+)", line) + if m: + return m.group(1) + except Exception: + pass + return None + + +def get_mmlu_accuracy(path): + """Return overall MMLU accuracy from saved JSON results, or None if not done.""" + results_dir = os.path.join(path, "eval_results", "mmlu") + if not os.path.isdir(results_dir): + return None + # lm_eval saves results under a subdirectory named after the model path + # (slashes replaced by __), then results_.json inside it + for fname in glob.glob(f"{results_dir}/**/results_*.json", recursive=True): + data = None + with contextlib.suppress(OSError, json.JSONDecodeError, KeyError): + data = json.load(open(fname)) + if data is not None: + results = data.get("results", {}) + if "mmlu" in results: + return results["mmlu"].get("acc,none") or results["mmlu"].get("acc") + # results dir exists but no readable JSON yet — still running + return "running" + + +def fmt_acc(acc): + """Format accuracy value for display.""" + if acc is None: + return "" + if acc == "running": + return "running" + return f"{acc:.4f}" + + +# --- main --- + +puzzle_dir = sys.argv[1].rstrip("/") if len(sys.argv) > 1 else None + +if puzzle_dir is None: + candidates = sorted(glob.glob("puzzle_dir_*") + glob.glob("../puzzle_dir_*")) + candidates += sorted(glob.glob("/workspace/puzzle_dir_*")) + seen: set = set() + deduped = [] + for c in candidates: + abs_c = os.path.abspath(c) + if abs_c not in seen: + seen.add(abs_c) + deduped.append(c) + candidates = [c for c in deduped if os.path.isdir(c)] + if len(candidates) == 1: + puzzle_dir = candidates[0] + elif len(candidates) == 0: + print("No puzzle_dir_* found. Specify: /puzzletron eval progress ") + sys.exit(1) + else: + print("Multiple puzzle directories found. Please specify one:") + for i, c in enumerate(candidates): + print(f" {i} {c}") + print("\nUsage: /puzzletron eval progress ") + sys.exit(1) + +entries = get_checkpoints(puzzle_dir) +if not entries: + print(f"No checkpoints found under {puzzle_dir}.") + sys.exit(1) + +running_ckpt = get_running_checkpoint() + +done = [] +running = [] +pending = [] +for label, path in entries: + acc = get_mmlu_accuracy(path) + if acc not in (None, "running"): + done.append((label, path)) + elif acc == "running" or ( + running_ckpt and os.path.abspath(path) == os.path.abspath(running_ckpt) + ): + running.append((label, path)) + else: + pending.append((label, path)) + +DIV = "─" * 66 +print(f"\nMMlu eval progress ({len(done)}/{len(entries)} done)") +print(DIV) +print(f" {'Status':<10} {'Checkpoint':<14} {'MMLU acc':>9} Path") +print(DIV) +for label, path in entries: + acc = get_mmlu_accuracy(path) + is_running = (acc == "running") or ( + running_ckpt and os.path.abspath(path) == os.path.abspath(running_ckpt) + ) + if acc not in (None, "running"): + status = "[DONE]" + acc_str = f"{acc:.4f}" + elif is_running: + status = "[RUNNING]" + acc_str = "..." + else: + status = "[ ]" + acc_str = "pending" + print(f" {status:<10} {label:<14} {acc_str:>9} {path}") +print(DIV) +print(f" Done: {len(done)}/{len(entries)}") +if running: + print(f" Running: {', '.join(lbl for lbl, _ in running)}") +if pending: + print(f" Pending: {', '.join(lbl for lbl, _ in pending)}") From c1fbcc432172145351a084b6122907fdfa7ec7e6 Mon Sep 17 00:00:00 2001 From: Daniel Korzekwa Date: Mon, 22 Jun 2026 08:16:27 -0700 Subject: [PATCH 3/3] add puzzletron eval to the tutorial Signed-off-by: Daniel Korzekwa --- .../puzzletron/adding_new_model_tutorial.md | 83 +++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/.agents/skills/puzzletron/adding_new_model_tutorial.md b/.agents/skills/puzzletron/adding_new_model_tutorial.md index 275380297a0..cf5fe6acd10 100644 --- a/.agents/skills/puzzletron/adding_new_model_tutorial.md +++ b/.agents/skills/puzzletron/adding_new_model_tutorial.md @@ -207,3 +207,86 @@ Results from: /workspace/puzzle_dir_qwen3_5-0.8b/mip_sweep_results.csv ``` Use this table to pick the compression rate that best meets your accuracy/memory budget. + +--- + +### Step 7: Run MMLU eval on all checkpoints + +After the sweep you have 7 compressed checkpoints plus the teacher. Evaluate all of them on MMLU (5-shot) to get an external benchmark score alongside the internal sweep losses. + +> **User:** evaluate all checkpoints on MMLU + +Claude first lists the available checkpoints: + +```text +/puzzletron eval list +``` + +Example output: + +```text +# Label MMLU Path +-------------------------------------------------------------------------------------------- +0 teacher /workspace/hf_models/Qwen/Qwen3.5-0.8B +1 10,000 MiB .../target_memory_10000MiB.../solution_0 +2 10,194 MiB .../target_memory_10194_.../solution_0 +3 12,233 MiB .../target_memory_12233_.../solution_0 +4 14,272 MiB .../target_memory_14272_.../solution_0 +5 16,311 MiB .../target_memory_16310_.../solution_0 +6 18,350 MiB .../target_memory_18349_.../solution_0 +7 20,389 MiB .../target_memory_20388_.../solution_0 + +Usage: /puzzletron eval mmlu + /puzzletron eval mmlu --limit 10 (smoke test) +``` + +Then runs all 8 checkpoints sequentially in a background task (`/puzzletron eval mmlu 0` through `7`). Results are saved next to each checkpoint at `/eval_results/mmlu/`. + +> **User:** show eval progress + +```text +/puzzletron eval progress +``` + +Example output mid-run: + +```text +MMlu eval progress (3/8 done) +────────────────────────────────────────────────────────────────── + Status Checkpoint MMLU acc Path +────────────────────────────────────────────────────────────────── + [DONE] teacher 0.5038 /workspace/hf_models/Qwen/Qwen3.5-0.8B + [DONE] 10,000 MiB 0.2365 .../target_memory_10000MiB.../solution_0 + [DONE] 10,194 MiB 0.2417 .../target_memory_10194_.../solution_0 + [RUNNING] 12,233 MiB ... .../target_memory_12233_.../solution_0 + [ ] 14,272 MiB pending + [ ] 16,311 MiB pending + [ ] 18,350 MiB pending + [ ] 20,389 MiB pending +────────────────────────────────────────────────────────────────── + Done: 3/8 + Running: 12,233 MiB + Pending: 14,272 MiB, 16,311 MiB, 18,350 MiB, 20,389 MiB +``` + +> **User:** show both in one table + +Claude joins the sweep CSV with the per-checkpoint MMLU JSON results: + +```text + rate target_mem actual_mem num_params lm_loss top_1 top_5 top_10 MMLU +---------------------------------------------------------------------------------------------------- + teacher 0.5038 + 0.50 10,194.4 10,143.3 888,813,280 3.2367 0.3663 0.6384 0.7251 0.2417 + 0.60 12,233.2 11,719.5 909,901,856 2.6377 0.4434 0.7198 0.7981 0.2446 + 0.70 14,272.1 14,083.8 941,534,720 1.8532 0.5855 0.8176 0.8735 0.2374 + 0.80 16,311.0 15,660.1 962,623,296 1.5385 0.6448 0.8576 0.9046 0.2636 + 0.90 18,349.9 18,024.4 994,256,160 1.2447 0.7064 0.8914 0.9278 0.3119 + 1.00 20,388.7 20,388.7 1,025,889,024 1.1067 0.7365 0.9079 0.9399 0.5038 +``` + +**Key observations for Qwen3.5-0.8B:** + +- The 1.0 rate (20,389 MiB) matches teacher MMLU exactly (0.5038) — a useful sanity check. +- Below rate 0.9, MMLU drops to ~0.24 (near random chance for 4-choice) even though token-level `top_1` still improves steadily. MMLU is a much stricter signal than token accuracy. +- The 0.9 rate (18,350 MiB, ~90 % of teacher memory) is the only compressed model with any meaningful MMLU recovery (0.31).