Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100] by vimeto · Pull Request #2164 · openai/parameter-golf

vimeto · 2026-05-22T10:53:33Z

PR #2140 fork + 3 levers + cap-fit (8×H100 SXM submission)

Status: ⏳ still waiting on the 8×H100 SXM RunPod run — GPU availability is basically empty right now (2026-05-13). The recipe + env block are already validated on 1×A100 SECURE. The challenge has technically ended already and we've been trying to grab a runpod 8×H100 pod ever since, but they have been fully unavailable the whole time, so we decided to put up this draft PR now without the actual H100 numbers rather than wait indefinitely.

TL;DR

upstream PR #2140 (s42 ref: 1.05591)
                  + ASYM_LOGIT_RESCALE=1     (−0.00132 BPB)
                  + GPTQ_CALIBRATION_BATCHES=32  (−0.00044)
                  + GPTQ_RESERVE_SECONDS=2.0     (−0.00012)
                  + EMBED_BITS=7                  (cap-fit: −487 KB)
                  + MLP_CLIP_SIGMAS=11.5          (cap-fit: −574 KB)
                  ─────────────────────────
                  ≈ 1.0554 BPB projected, 15,981,515 bytes (under 16 MB cap)

This PR adds records/submission_2026_05_13_pr2140_h100_capfit/ — basically a faithful port of upstream PR #2140 with three BPB levers and and a critical cap-fit correction.

Why the bundled artifact is currently 17.20 MB (over cap)

The .ptz artifact shipped in this PR is a placeholder from our LUMI training run, at 17,196,199 bytes (over the 16 MB submission cap). This is intentional and explained below; the H100 production run will replace it with the cap-compliant artifact.

The LUMI run used the script's silent defaults for two bit-allocation hyperparameters that PR #2140's published env block explicitly sets:

EMBED_BITS: LUMI ran with 8 (script default) instead of PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's 7 — costs ~487 KB
MLP_CLIP_SIGMAS: LUMI ran with 10.0 (script default) instead of PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's 11.5 — costs ~574 KB

Total cap miss attributable to these two env vars: ~1.06 MB, almost exactly the 1.20 MB the LUMI artifact is over cap.

We validated this empirically on 2026-05-13 with a 1×A100 SECURE pod running QUANTIZE_ONLY=1 against the LUMI fp32 weights:

Config	Artifact size	Pre-TTT BPB	Est. post-TTT
EMBED=8, MLP_CLIP=10.0 (LUMI defaults)	17.04 MB	1.06532	~1.0533
EMBED=7, MLP_CLIP=10.0	16.56 MB	1.06619	~1.0542
EMBED=6, MLP_CLIP=10.0	16.10 MB	1.06978	~1.0578
EMBED=7, MLP_CLIP=11.5 ⭐	15.98 MB	1.06739	~1.0554

The selected EMBED=7 + MLP_CLIP=11.5 configuration costs ~+0.002 BPB vs the LUMI defaults but fits under the 16 MB cap with 18 KB margin. The MLP_CLIP_SIGMAS=10.0 → 11.5 change is counter-intuitive — looser clipping = better int6 grid usage on MLP weights = smaller LQER residual. Discovered 2026-05-13 via comparison with records/upstream_pr2135/.

Once the 8×H100 RunPod run completes, this PR will be updated with the cap-compliant ~15.98 MB artifact and the placeholder will be replaced; the H100 stdout should then match every anchor in expected_log.txt, including the ≤16 MB Total submission size.

Architecture lineage

PR #2140 (SP8192_CaseOps_Progressive3k_ShortDocTTT) is itself built on PR #2014 (last clean record per upstream Issue #2127 audit) plus token-only n-gram tilt from PR #1145 / PR #2018. Our additions are surgical:

ASYM_LOGIT_RESCALE=1 (lever −0.00132 BPB) — adds learnable softcap_pos/softcap_neg scalars; replaces the symmetric tanh logit softcap with an asymmetric variant. Set during BOTH train and TTT (otherwise state_dict mismatch).
GPTQ_CALIBRATION_BATCHES=32 (lever −0.00044) — doubles PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 16; more Hessian data = better int6 grid choice.
GPTQ_RESERVE_SECONDS=2.0 (lever −0.00012) — halves PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 4.0; gives the training loop 2 more seconds before GPTQ starts.

Reproduction

The submission contract is satisfied by a single Python script driving end-to-end (train + GPTQ + AWQ-lite + LQER + serialize + post-quant eval + TTT eval).

# 8×H100 SXM pod, vimetoivonen/pgolf:pr2140 image
git clone https://github.com/<your-fork>/parameter-golf.git /workspace/repo
cd /workspace/repo
bash scripts/runpod_pr2140_repro.sh

The wrapper handles HF data download (romeerp/parameter-golf-caseops-v1 — Issue #2127 audit-clean), optional throughput smoke, and a single torchrun --nproc_per_node=8 records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.py. Full env block in REPRODUCE.md. Wallclock: ~22 min (~540 s train, ~720 s eval phase).

Files in this PR

File	Role
`records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.py`	The fork — 4990 lines. Verbatim copy of `specs/batch200_frontier/run_pr2140_ref.py`. PR #2140 + 3 levers + LUMI compat patches (no-ops on H100).
`records/.../prepare_caseops_data.py`	HuggingFace dataset prep (`romeerp/parameter-golf-caseops-v1`).
`records/.../online_ngram_tilt.py` + `online_ngram_state.c`	Token-only n-gram tilt helper. clang-aware compile fallback.
`records/.../submission.json`	Structured metadata: recipe identity, lineage, projected BPB targets + tolerances. The `val_bpb`/`val_bpb_postquant`/`artifact_bytes` fields hold projected targets pending the H100 run.
`records/.../README.md`	Full context, lineage diagram, hardware/software, reproduction.
`records/.../expected_log.txt`	Reference log of what the H100 run is expected to produce. Regex-match the actual stdout against the anchor lines here.
`records/.../lumi_logs.txt`	The exact 8×MI250X train + TTT logs that produced the placeholder artifact (val_bpb 1.05320, 17.20 MB). Provenance reference.
`records/.../final_model.int6.ptz`	Placeholder (8×MI250X artifact at 17.20 MB). Replaced by H100 cap-fit artifact when production run completes.

Validation (regex-match against `expected_log.txt`)

When the H100 run completes, match its stdout against the anchor lines in expected_log.txt. The load-bearing lines to check:

Anchor	Expected	Why
`embed_bits: 7`	exactly 7	Cap-fit lever; `8` → 487 KB over cap
`mlp_clip_sigmas: 11.5`	exactly 11.5	Cap-fit lever; `10.0` → 574 KB over cap
`gptq_calibration_batches: 32`	32	Lever −0.00044 BPB
`gptq_reserve_seconds: 2.0`	2.0	Lever −0.00012 BPB
`softcap_pos` / `softcap_neg` in `passthrough (float16)`	present	ASYM_LOGIT_RESCALE evidence (lever −0.00132)
`gptq (int7)+lqer_asym: tok_emb.weight`	int7 (not int8)	Direct consequence of EMBED_BITS=7
`stopping_early ... step:`	~5067 (±800)	Wallclock-capped, not iter-capped
`diagnostic pre-quantization ... val_bpb:`	~1.058 ±0.002	Pre-quant target
`diagnostic quantized ... val_bpb:`	~1.06739 ±0.002	Post-quant pre-TTT (A100-validated)
`Total submission size quantized+pergroup:`	≤ 16,000,000 bytes	The cap; expected ~15,981,515
`quantized_ttt_phased ... val_bpb:`	~1.0554 ±0.003	Final result (seed lottery)

lumi_logs.txt is the negative reference — it uses the pre-cap-fit env (embed_bits=8, mlp_clip_sigmas=10.0, 17.20 MB artifact), so a correct H100 run should differ from it on exactly those cap-fit anchors.

Acceptance criteria for the H100 run

Metric	Expected	Hard cap
Post-TTT val_bpb	~1.0554 ± 0.003	—
Pre-quant val_bpb	~1.058 ± 0.002	—
Post-quant pre-TTT val_bpb	~1.06739 ± 0.002	—
Artifact size	~15,981,515 ± 200,000 bytes	≤ 16,000,000 bytes
Train wallclock	~540 s	≤ 600 s (OpenAI train cap)
Eval phase wallclock	~540 s	⚠ Track-A eval cap is 600 s; expected ~720 s end-to-end including ~125 s lrzip + ~540 s TTT — non-record on eval budget. Train budget is met.

Cross-references

REPRODUCE.md (single source of truth)
RESEARCH_LUMI_NCCL_MISALIGNMENT.md (why LUMI infrastructure could not produce a cap-fit run end-to-end)
records/locked_1.05320_s42/ (the LUMI logs the placeholder artifact comes from)
Upstream PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140, Issue Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127, PR Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment #1145

…penai#2140 fork)

…i#2164 new, SOTA 1.05651 locked day 18 https://claude.ai/code/session_019u4yjMahxnGXKzKzYp9gv7

Add submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR o…

7b27177

…penai#2140 fork)

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 22, 2026

research(2026-05-22): daily research log — cap-fit bit alloc PR opena…

df43b09

…i#2164 new, SOTA 1.05651 locked day 18 https://claude.ai/code/session_019u4yjMahxnGXKzKzYp9gv7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]#2164

Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]#2164
vimeto wants to merge 1 commit into
openai:mainfrom
vimeto:submission/2026-05-13-asymlogit-capfit

vimeto commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vimeto commented May 22, 2026

PR #2140 fork + 3 levers + cap-fit (8×H100 SXM submission)

TL;DR

Why the bundled artifact is currently 17.20 MB (over cap)

Architecture lineage

Reproduction

Files in this PR

Validation (regex-match against expected_log.txt)

Acceptance criteria for the H100 run

Cross-references

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Validation (regex-match against `expected_log.txt`)