Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]#2164
Draft
vimeto wants to merge 1 commit into
Draft
Conversation
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
May 22, 2026
…i#2164 new, SOTA 1.05651 locked day 18 https://claude.ai/code/session_019u4yjMahxnGXKzKzYp9gv7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #2140 fork + 3 levers + cap-fit (8×H100 SXM submission)
Status: ⏳ still waiting on the 8×H100 SXM RunPod run — GPU availability is basically empty right now (2026-05-13). The recipe + env block are already validated on 1×A100 SECURE. The challenge has technically ended already and we've been trying to grab a runpod 8×H100 pod ever since, but they have been fully unavailable the whole time, so we decided to put up this draft PR now without the actual H100 numbers rather than wait indefinitely.
TL;DR
This PR adds
records/submission_2026_05_13_pr2140_h100_capfit/— basically a faithful port of upstream PR #2140 with three BPB levers and and a critical cap-fit correction.Why the bundled artifact is currently 17.20 MB (over cap)
The
.ptzartifact shipped in this PR is a placeholder from our LUMI training run, at 17,196,199 bytes (over the 16 MB submission cap). This is intentional and explained below; the H100 production run will replace it with the cap-compliant artifact.The LUMI run used the script's silent defaults for two bit-allocation hyperparameters that PR #2140's published env block explicitly sets:
EMBED_BITS: LUMI ran with8(script default) instead of PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's7— costs ~487 KBMLP_CLIP_SIGMAS: LUMI ran with10.0(script default) instead of PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's11.5— costs ~574 KBTotal cap miss attributable to these two env vars: ~1.06 MB, almost exactly the 1.20 MB the LUMI artifact is over cap.
We validated this empirically on 2026-05-13 with a 1×A100 SECURE pod running
QUANTIZE_ONLY=1against the LUMI fp32 weights:The selected
EMBED=7 + MLP_CLIP=11.5configuration costs ~+0.002 BPB vs the LUMI defaults but fits under the 16 MB cap with 18 KB margin. TheMLP_CLIP_SIGMAS=10.0 → 11.5change is counter-intuitive — looser clipping = better int6 grid usage on MLP weights = smaller LQER residual. Discovered 2026-05-13 via comparison withrecords/upstream_pr2135/.Once the 8×H100 RunPod run completes, this PR will be updated with the cap-compliant ~15.98 MB artifact and the placeholder will be replaced; the H100 stdout should then match every anchor in
expected_log.txt, including the ≤16 MBTotal submission size.Architecture lineage
PR #2140 (
SP8192_CaseOps_Progressive3k_ShortDocTTT) is itself built on PR #2014 (last clean record per upstream Issue #2127 audit) plus token-only n-gram tilt from PR #1145 / PR #2018. Our additions are surgical:ASYM_LOGIT_RESCALE=1(lever −0.00132 BPB) — adds learnablesoftcap_pos/softcap_negscalars; replaces the symmetric tanh logit softcap with an asymmetric variant. Set during BOTH train and TTT (otherwise state_dict mismatch).GPTQ_CALIBRATION_BATCHES=32(lever −0.00044) — doubles PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 16; more Hessian data = better int6 grid choice.GPTQ_RESERVE_SECONDS=2.0(lever −0.00012) — halves PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's default of 4.0; gives the training loop 2 more seconds before GPTQ starts.Reproduction
The submission contract is satisfied by a single Python script driving end-to-end (train + GPTQ + AWQ-lite + LQER + serialize + post-quant eval + TTT eval).
The wrapper handles HF data download (
romeerp/parameter-golf-caseops-v1— Issue #2127 audit-clean), optional throughput smoke, and a singletorchrun --nproc_per_node=8 records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.py. Full env block inREPRODUCE.md. Wallclock: ~22 min (~540 s train, ~720 s eval phase).Files in this PR
records/submission_2026_05_13_pr2140_h100_capfit/train_gpt.pyspecs/batch200_frontier/run_pr2140_ref.py. PR #2140 + 3 levers + LUMI compat patches (no-ops on H100).records/.../prepare_caseops_data.pyromeerp/parameter-golf-caseops-v1).records/.../online_ngram_tilt.py+online_ngram_state.crecords/.../submission.jsonval_bpb/val_bpb_postquant/artifact_bytesfields hold projected targets pending the H100 run.records/.../README.mdrecords/.../expected_log.txtrecords/.../lumi_logs.txtrecords/.../final_model.int6.ptzValidation (regex-match against
expected_log.txt)When the H100 run completes, match its stdout against the anchor lines in
expected_log.txt. The load-bearing lines to check:embed_bits: 78→ 487 KB over capmlp_clip_sigmas: 11.510.0→ 574 KB over capgptq_calibration_batches: 32gptq_reserve_seconds: 2.0softcap_pos/softcap_neginpassthrough (float16)gptq (int7)+lqer_asym: tok_emb.weightstopping_early ... step:diagnostic pre-quantization ... val_bpb:diagnostic quantized ... val_bpb:Total submission size quantized+pergroup:quantized_ttt_phased ... val_bpb:lumi_logs.txtis the negative reference — it uses the pre-cap-fit env (embed_bits=8,mlp_clip_sigmas=10.0, 17.20 MB artifact), so a correct H100 run should differ from it on exactly those cap-fit anchors.Acceptance criteria for the H100 run
Cross-references
REPRODUCE.md(single source of truth)RESEARCH_LUMI_NCCL_MISALIGNMENT.md(why LUMI infrastructure could not produce a cap-fit run end-to-end)records/locked_1.05320_s42/(the LUMI logs the placeholder artifact comes from)