Skip to content

Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]#2164

Draft
vimeto wants to merge 1 commit into
openai:mainfrom
vimeto:submission/2026-05-13-asymlogit-capfit
Draft

Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]#2164
vimeto wants to merge 1 commit into
openai:mainfrom
vimeto:submission/2026-05-13-asymlogit-capfit

Conversation

@vimeto
Copy link
Copy Markdown

@vimeto vimeto commented May 22, 2026

PR #2140 fork + 3 levers + cap-fit (8×H100 SXM submission)

Status: ⏳ still waiting on the 8×H100 SXM RunPod run — GPU availability is basically empty right now (2026-05-13). The recipe + env block are already validated on 1×A100 SECURE. The challenge has technically ended already and we've been trying to grab a runpod 8×H100 pod ever since, but they have been fully unavailable the whole time, so we decided to put up this draft PR now without the actual H100 numbers rather than wait indefinitely.

TL;DR

upstream PR #2140 (s42 ref: 1.05591)
                  + ASYM_LOGIT_RESCALE=1     (−0.00132 BPB)
                  + GPTQ_CALIBRATION_BATCHES=32  (−0.00044)
                  + GPTQ_RESERVE_SECONDS=2.0     (−0.00012)
                  + EMBED_BITS=7                  (cap-fit: −487 KB)
                  + MLP_CLIP_SIGMAS=11.5          (cap-fit: −574 KB)
                  ─────────────────────────
                  ≈ 1.0554 BPB projected, 15,981,515 bytes (under 16 MB cap)

This PR adds records/submission_2026_05_13_pr2140_h100_capfit/ — basically a faithful port of upstream PR #2140 with three BPB levers and and a critical cap-fit correction.

Why the bundled artifact is currently 17.20 MB (over cap)

The .ptz artifact shipped in this PR is a placeholder from our LUMI training run, at 17,196,199 bytes (over the 16 MB submission cap). This is intentional and explained below; the H100 production run will replace it with the cap-compliant artifact.

The LUMI run used the script's silent defaults for two bit-allocation hyperparameters that PR #2140's published env block explicitly sets:

Total cap miss attributable to these two env vars: ~1.06 MB, almost exactly the 1.20 MB the LUMI artifact is over cap.

We validated this empirically on 2026-05-13 with a 1×A100 SECURE pod running QUANTIZE_ONLY=1 against the LUMI fp32 weights:

Config Artifact size Pre-TTT BPB Est. post-TTT
EMBED=8, MLP_CLIP=10.0 (LUMI defaults) 17.04 MB 1.06532 ~1.0533
EMBED=7, MLP_CLIP=10.0 16.56 MB 1.06619 ~1.0542
EMBED=6, MLP_CLIP=10.0 16.10 MB 1.06978 ~1.0578
EMBED=7, MLP_CLIP=11.5 15.98 MB 1.06739 ~1.0554

The selected EMBED=7 + MLP_CLIP=11.5 configuration costs ~+0.002 BPB vs the LUMI defaults but fits under the 16 MB cap with 18 KB margin. The MLP_CLIP_SIGMAS=10.0 → 11.5 change is counter-intuitive — looser clipping = better int6 grid usage on MLP weights = smaller LQER residual. Discovered 2026-05-13 via comparison with records/upstream_pr2135/.

Once the 8×H100 RunPod run completes, this PR will be updated with the cap-compliant ~15.98 MB artifact and the placeholder will be replaced; the H100 stdout should then match every anchor in expected_log.txt, including the ≤16 MB Total submission size.

Architecture lineage

PR #2140 (SP8192_CaseOps_Progressive3k_ShortDocTTT) is itself built on PR #2014 (last clean record per upstream Issue #2127 audit) plus token-only n-gram tilt from PR #1145 / PR #2018. Our additions are surgical:

  1. ASYM_LOGIT_RESCALE=1 (lever −0.00132 BPB) — adds learnable softcap_pos/softcap_neg scalars; replaces the symmetric tanh logit softcap with an asymmetric variant. Set during BOTH train and TTT (otherwise state_dict mismatch).
  2. GPTQ_CALIBRATION_BATCHES=32 (lever −0.00044) — doubles PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 16; more Hessian data = better int6 grid choice.
  3. GPTQ_RESERVE_SECONDS=2.0 (lever −0.00012) — halves PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 4.0; gives the training loop 2 more seconds before GPTQ starts.

Reproduction

The submission contract is satisfied by a single Python script driving end-to-end (train + GPTQ + AWQ-lite + LQER + serialize + post-quant eval + TTT eval).

# 8×H100 SXM pod, vimetoivonen/pgolf:pr2140 image
git clone https://github.com/<your-fork>/parameter-golf.git /workspace/repo
cd /workspace/repo
bash scripts/runpod_pr2140_repro.sh

The wrapper handles HF data download (romeerp/parameter-golf-caseops-v1 — Issue #2127 audit-clean), optional throughput smoke, and a single torchrun --nproc_per_node=8 records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.py. Full env block in REPRODUCE.md. Wallclock: ~22 min (~540 s train, ~720 s eval phase).

Files in this PR

File Role
records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.py The fork — 4990 lines. Verbatim copy of specs/batch200_frontier/run_pr2140_ref.py. PR #2140 + 3 levers + LUMI compat patches (no-ops on H100).
records/.../prepare_caseops_data.py HuggingFace dataset prep (romeerp/parameter-golf-caseops-v1).
records/.../online_ngram_tilt.py + online_ngram_state.c Token-only n-gram tilt helper. clang-aware compile fallback.
records/.../submission.json Structured metadata: recipe identity, lineage, projected BPB targets + tolerances. The val_bpb/val_bpb_postquant/artifact_bytes fields hold projected targets pending the H100 run.
records/.../README.md Full context, lineage diagram, hardware/software, reproduction.
records/.../expected_log.txt Reference log of what the H100 run is expected to produce. Regex-match the actual stdout against the anchor lines here.
records/.../lumi_logs.txt The exact 8×MI250X train + TTT logs that produced the placeholder artifact (val_bpb 1.05320, 17.20 MB). Provenance reference.
records/.../final_model.int6.ptz Placeholder (8×MI250X artifact at 17.20 MB). Replaced by H100 cap-fit artifact when production run completes.

Validation (regex-match against expected_log.txt)

When the H100 run completes, match its stdout against the anchor lines in expected_log.txt. The load-bearing lines to check:

Anchor Expected Why
embed_bits: 7 exactly 7 Cap-fit lever; 8 → 487 KB over cap
mlp_clip_sigmas: 11.5 exactly 11.5 Cap-fit lever; 10.0 → 574 KB over cap
gptq_calibration_batches: 32 32 Lever −0.00044 BPB
gptq_reserve_seconds: 2.0 2.0 Lever −0.00012 BPB
softcap_pos / softcap_neg in passthrough (float16) present ASYM_LOGIT_RESCALE evidence (lever −0.00132)
gptq (int7)+lqer_asym: tok_emb.weight int7 (not int8) Direct consequence of EMBED_BITS=7
stopping_early ... step: ~5067 (±800) Wallclock-capped, not iter-capped
diagnostic pre-quantization ... val_bpb: ~1.058 ±0.002 Pre-quant target
diagnostic quantized ... val_bpb: ~1.06739 ±0.002 Post-quant pre-TTT (A100-validated)
Total submission size quantized+pergroup: ≤ 16,000,000 bytes The cap; expected ~15,981,515
quantized_ttt_phased ... val_bpb: ~1.0554 ±0.003 Final result (seed lottery)

lumi_logs.txt is the negative reference — it uses the pre-cap-fit env (embed_bits=8, mlp_clip_sigmas=10.0, 17.20 MB artifact), so a correct H100 run should differ from it on exactly those cap-fit anchors.

Acceptance criteria for the H100 run

Metric Expected Hard cap
Post-TTT val_bpb ~1.0554 ± 0.003
Pre-quant val_bpb ~1.058 ± 0.002
Post-quant pre-TTT val_bpb ~1.06739 ± 0.002
Artifact size ~15,981,515 ± 200,000 bytes ≤ 16,000,000 bytes
Train wallclock ~540 s ≤ 600 s (OpenAI train cap)
Eval phase wallclock ~540 s ⚠ Track-A eval cap is 600 s; expected ~720 s end-to-end including ~125 s lrzip + ~540 s TTT — non-record on eval budget. Train budget is met.

Cross-references

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant