pretraining-training-regime-page#139
Conversation
…training training-regime identity
… pretraining page bundle]
…ning tradeoffs and stage boundaries]
…n for the pretraining reader journey]
|
Local recheck on 2026-06-20T00:12:00Z (UTC): the current head 370ffb9 is still the expected pretraining slice, PR conversation feedback is still empty, and I re-ran bun run lint, bun run typecheck, and bun run test successfully on this branch. The remaining blocker is GitHub Actions infrastructure state rather than reviewed code: workflow run 27844711779 has stayed queued with every job pending since 2026-06-19T19:29:11Z. I submitted gh run cancel 27844711779, but GitHub still reports the same run as queued, and this repo's ci.yml does not expose workflow_dispatch, so there is no direct manual requeue path from the CLI without a new push. |
|
Local mergeability follow-up on 2026-06-19T19:34:36Z (UTC): I re-ran The previous required CI run was stale with the final aggregate |
|
Mergeability follow-up on 2026-06-21T21:08:20Z (UTC): I re-checked PR #139 on head I also reran |
|
Mergeability follow-up on 2026-06-21T21:25:10Z (UTC): I re-checked PR #139 on head I waited past the workflow's 15-minute window and re-queried run |
|
Mergeability follow-up on 2026-06-21T21:30:10Z (UTC): I re-checked PR #139 on head I reran |
|
Mergeability follow-up on 2026-06-21T21:34:12Z (UTC): I re-checked PR #139 on head I reran the local mergeability gate on the unchanged reviewed content: |
|
Mergeability follow-up on 2026-06-21T21:39:50Z (UTC): I re-checked PR #139 on head I reran the local mergeability gate on the unchanged reviewed content: Because the previous visible CI run was still attached only to older heads and the live head had zero attached check runs, I pushed mergeability-only empty commit |
|
Mergeability follow-up on 2026-06-21T21:51:40Z (UTC): I merged origin/main into this branch to clear the PR mergeable_state dirty blocker and pushed the new head 7c3b389. What changed on top of the already-complete pretraining slice:
Local validation on this head passed before push:
The reviewed pretraining diff is still present on the PR (src/content/docs/training/pretraining/*, src/content/registry/training-regimes/pretraining.json, src/content/registry/graphs/pretraining-training-flow.json, and the focused validation updates), and fresh required CI run 27918642630 is now attached to the live head. |
|
BLOCKING review summary for PR #139 on head Quality-check evidence:
Blocking findings:
Acceptance criteria:
Behavioral-assertion audit for stories marked
Docs-writing standards checklist:
Graphing standards checklist for the added flow graph:
General website standards checklist for this slice:
Required fixes before merge:
|
… pretraining page]
|
Addressed the latest blocking PR conversation feedback on head Mapped fixes:
Validation run on this iteration:
The refreshed PR diff still contains the intended pretraining slice, and GitHub Actions run |
…or the pretraining slice
|
Mergeability follow-up on 2026-06-21T22:21:30Z (UTC): the current PR head failed |
…n for the pretraining slice]
|
Mergeability follow-up on 2026-06-21T22:31:37Z (UTC): I addressed the failing What changed:
Why this was needed:
The reviewed pretraining slice is still present in the PR diff, and fresh CI should now attach to this new head. |
|
Mergeability follow-up on 2026-06-21T22:31:37Z (UTC): after the prior push, PR #139 flipped to What I changed on the new head
The pretraining slice is still present in the PR diff, the merged graph runtime now consistently reports |
|
BLOCKING review summary for PR #139 on head This comment supersedes the earlier blocking content review: the prior GPT-2 bridge, equation-definition, graph title/legend, and message-key blockers appear addressed on the current head. The remaining blockers are new correctness and quality issues introduced by the latest implementation. Quality-check evidence:
Blocking findings:
Required fixes before merge:
Project acceptance criteria:
Behavioral-assertion audit for stories marked
Docs-writing standards checklist:
Graphing standards checklist:
General website standards checklist:
Review-rule application:
|
{
"project": "Model Atlas — Pretraining Training-Regime Canonical Page",
"branchName": "pretraining-training-regime-page",
"description": "Publish one canonical English
pretrainingtraining-regime page, backed by a canonical registry record, localized messages, and focused discoverability wiring, so readers can follow the GPT-2 and broader model-family journey through the training stage that comes before alignment or deployment.","context": {
"customerAsk": "Customer ask alignment: add the canonical English training-regime page for
pretrainingso readers can follow the GPT-2 and broader model-family path through the training stage that comes before alignment or deployment. Treat this as one mergeable training page slice rather than a broad training taxonomy rewrite. Scope: createsrc/content/docs/training/pretraining/withpage.mdx,messages/en.json, andassets.json, plus the backing registry record undersrc/content/registry/training-regimes/if it does not already exist; classify it as atraining-regime; connect it to GPT-2, GPT-3, transformer architecture, tokenization pages, alignment pages such as RLHF and DPO, and any nearby glossary pages that materially strengthen the reader journey; and add only the minimum focused tests/validation needed for the touched registry/content surfaces. The explanation should tell a technical layperson what pretraining is, what objective it usually uses, why scale, data mixture, and compute matter, and how pretraining differs from post-training or alignment. Acceptance criteria: the pretraining page exists as a canonical registry-backed training-regime page on currentmain, it is discoverable from training/model search paths, and the focused touched checks pass.","problem": "The site has reader journeys around models, transformer architecture, tokenization, and alignment-adjacent training methods, but it still lacks the broad canonical page for pretraining itself. That leaves a gap in the GPT-2 and GPT-3 learning path: readers can see that models were trained and later aligned, but they do not yet have one stable page that explains the large initial stage where the base model learns general statistical structure from massive token data. Without a canonical
pretrainingpage, search and related-doc paths also have no authoritative training-regime destination for this concept.","solution": "Ship a canonical
/docs/training/pretrainingpage using the established training-regime page contract, an Englishmessages/en.json, a colocatedassets.json, and a canonicaltraining-regime.pretrainingregistry record if missing. Keep the page narrowly focused on what pretraining is, the usual next-token objective, why scale, data mixture, and compute matter, and how pretraining differs from post-training or alignment. Add only the minimum tags, aliases, related-doc inputs, and focused tests needed so GPT-2, GPT-3, transformer architecture, tokenization, RLHF, DPO, and nearby glossary paths can hand readers into and out of this page cleanly."},
"acceptanceCriteria": [
"A published canonical docs page exists for
pretrainingunder the training docs tree withkind: \"training-regime\", a matching canonical registry record,messages/en.json, and a colocatedassets.json.","The page is understandable in isolation for a technical layperson and explains what pretraining is, the usual objective it uses, why scale, data mixture, and compute matter, and how it differs from post-training or alignment.",
"Registry-backed discovery surfaces make the page reachable from relevant model, concept, module, glossary, and alignment-adjacent paths, including GPT-2, GPT-3, transformer architecture, tokenization, RLHF, and DPO where those canonical pages already exist.",
"Search surfaces can find the page by title, aliases, and core terms such as
pretraining,language model pretraining,next-token prediction, andbase model training.","The implementation stays narrowly scoped to one mergeable training page slice and does not broaden into a training taxonomy rewrite, broad model-page rewrites, or unrelated relationship cleanup.",
"Focused automated coverage proves the new page bundle, registry record, discoverability wiring, and at least one rendered reader-journey handoff for this slice.",
"Quality gate: typecheck, lint, and tests pass."
],
"userStories": [
{
"id": "pretraining-training-regime-page-001",
"title": "Create the canonical pretraining training-regime identity",
"description": "As a reader searching for pretraining, I want one canonical registry-backed training-regime identity so that search, tags, and related-doc surfaces lead me to the right training-stage explainer instead of scattered references inside model or alignment pages.",
"acceptanceCriteria": [
"A canonical
training-regime.pretrainingregistry record exists if one is not already published, withkind: \"training-regime\", slugpretraining, stable aliases, tags, and training-category metadata appropriate for a broad base-model training page.","The record is classified as a
training-regimeand not as a concept, glossary term, or model page.","Registry relationships connect the record to GPT-2, GPT-3, transformer architecture, tokenization, RLHF, DPO, and any nearby glossary pages that materially strengthen the reader journey when those canonical targets already exist.",
"Search metadata supports representative queries such as
pretraining,language model pretraining,base model training, andnext-token prediction.","Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": ""
},
{
"id": "pretraining-training-regime-page-002",
"title": "Publish the canonical pretraining page bundle",
"description": "As a technical layperson learning how language models are made, I want a dedicated pretraining page so that I can understand what the base model learns before alignment, fine-tuning, or deployment discussions begin.",
"acceptanceCriteria": [
"A canonical page exists at
/docs/training/pretrainingwith matching frontmatter,messages/en.json, andassets.json, and the MDX remains structural rather than carrying raw reader-facing prose.","The page opens with one concise summary and clearly explains in plain language that pretraining is the large initial learning stage where a model is exposed to massive token sequences and learns general statistical patterns before later behavior shaping.",
"The page explains the usual objective in plain language, including next-token prediction or closely related autoregressive language-model training, without assuming the reader already understands optimization jargon.",
"The page includes the minimum useful visual or equation support for the training loop if the concept is clearer with it, following the training-regime and graphing standards.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": ""
},
{
"id": "pretraining-training-regime-page-003",
"title": "Teach the key pretraining tradeoffs and stage boundaries",
"description": "As a reader comparing training stages, I want the page to explain why data scale, data mixture, and compute matter, and how pretraining differs from post-training or alignment, so I can place GPT-2 and similar models in the broader model-development path.",
"acceptanceCriteria": [
"The page clearly explains why scale matters, including that larger parameter counts, more data, and longer training runs can raise capability but also raise cost.",
"The page clearly explains why data mixture matters, including that the model absorbs the patterns present in web text, books, code, or other sources and that mixture choices shape what the base model becomes good at.",
"The page clearly explains why compute matters, including that the training run is limited by hardware time, memory, and optimization budget rather than by ideas alone.",
"The page clearly distinguishes pretraining from post-training or alignment, including that pretraining builds the broad base model while later stages shape instruction following, preference behavior, or product behavior.",
"The rendered page exposes clear handoffs to GPT-2, GPT-3, transformer architecture, tokenization, RLHF, DPO, and any nearby glossary pages used for this slice, without dead-end related-doc sections.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": ""
},
{
"id": "pretraining-training-regime-page-004",
"title": "Add focused validation for the pretraining reader journey",
"description": "As a maintainer, I want focused automated proof for the pretraining page slice so that registry, route, search, and reader-journey regressions are caught without expanding into unrelated content audits.",
"acceptanceCriteria": [
"Focused tests or validation confirm the
/docs/training/pretrainingroute, the canonicaltraining-regime.pretrainingrecord, the default English messages, and the primary asset wiring resolve together.","Coverage asserts at least one discoverability outcome for representative pretraining queries and at least one reader-journey outcome tying the page to GPT-2, GPT-3, transformer architecture, tokenization, or alignment-adjacent pages.",
"The touched checks stay limited to registry, content, search, and rendered handoff integrity for this slice rather than broad training-inventory audits or unrelated taxonomy enforcement.",
"Typecheck passes",
"Tests pass"
],
"priority": 4,
"passes": true,
"notes": ""
}
]
}