tokenizer-mismatch-module-page#123
Conversation
|
Completed story tokenizer-mismatch-module-page-004. Added a focused tokenizer-mismatch runtime contract test that verifies the canonical route, published registry classification, English messages, search aliases, and tokenization tag discovery together. Also fixed mergeability-only test contracts revealed by the full suite by updating the module sidebar inventory plus bundled graph/table runtime inventories for the new tokenizer-mismatch assets. Local verification for this iteration: make lint, make typecheck, make validate-data, make test. |
|
BLOCKING Primary findings:
Quality checks:
Project acceptance criteria:
User story acceptance criteria:
Website standards:
Docs writing standards:
Graphing standards:
Required fixes before this can merge:
|
38598bd to
149af60
Compare
|
Addressed the blocking feedback on the current PR head. Fixed item 1: removed the false GPT-2 release attribution from Fixed item 2: replaced the unsupported GPT-2-only citation set with four references that support the actual claims on the page:
Fixed item 3: rebased the branch onto current Local verification on the rebased head:
|
…match as a first-class tokenization module]
…nizer mismatch explainer page
… discoverable from nearby tokenization and model-input paths]
…for the tokenizer mismatch page contract]
…for the tokenizer mismatch page contract]
…for the tokenizer mismatch page contract]
149af60 to
ee830ab
Compare
|
Addressed the follow-up mergeability failure on the current PR head. GitHub
I rebased the branch onto current
The PR head is now |
|
BLOCKING — reviewed PR #123 on head Primary findings:
Quality checks:
Project acceptance criteria:
User story acceptance criteria:
Website standards:
Docs writing standards:
Graphing standards:
Required fixes before merge:
|
{
"project": "Tokenizer Mismatch Canonical Page",
"branchName": "tokenizer-mismatch-module-page",
"description": "Add one canonical English docs page for tokenizer mismatch so readers can understand why the wrong tokenizer, special-token handling, or prompt-format assumptions cause practical failures or degraded model behavior, and find that explanation through the tokenizer-family discovery surfaces.",
"context": {
"customerAsk": "Customer ask alignment: add a canonical English page for
tokenizer mismatch, which is explicitly called for in the roadmap bundle but is still absent from the shipped docs tree. Treat this as one narrow tokenization-module slice on currentmain. Scope: create the correct canonical page kind and route fortokenizer mismatchaccording to the current standards, including any missing registry record,page.mdx,messages/en.json, andassets.json; connect it to existing tokenization, model, and serving pages such astokenizers-overview,bpe,wordpiece,sentencepiece,special-tokens, andembedding; and add only the minimum focused validation/tests needed for the touched surfaces. The prose should explain in simple language what tokenizer mismatch is, why it causes practical failures or degraded behavior, and where readers will see it in real model usage. Keep the slice English-only and avoid broad tokenization-family rewrites. Acceptance criteria: the canonical tokenizer-mismatch page exists on currentmain, is registry-backed and discoverable, and focused quality gates for the touched content surfaces pass.","problem": "The roadmap already calls for a
tokenizer mismatchpage, but the shipped docs tree still does not provide one. Readers can learn what tokens, embeddings, special tokens, BPE, WordPiece, and SentencePiece are, yet they do not have a dedicated page explaining the practical failure mode that appears when a model is paired with the wrong tokenizer, different special-token handling, or text that is segmented differently from what the model was trained to expect. That leaves a discovery gap across the tokenizer family and nearby model-input or serving paths.","solution": "Publish
tokenizer-mismatchas a canonicalmodulepage in the tokenization family, backed by a validmodule.tokenizer-mismatchregistry record if missing, English-localized messages, and focused discovery metadata that ties it totokenizers-overview,bpe,wordpiece,sentencepiece,special-tokens,embedding, and representative model or serving pages where those relationships are already justified. Add only narrow behavioral validation for the route, registry linkage, and discovery surfaces touched by this page."},
"acceptanceCriteria": [
"A published
/docs/modules/tokenizer-mismatchpage exists on currentmainwith canonical module-page structure and plain-language copy explaining what tokenizer mismatch is, why it matters, and where readers see it in practice.","A valid published registry record exists for
module.tokenizer-mismatchwith tokenization-family classification, aliases, tags, summary metadata, and related-document fields aligned with project docs and the data model.","The page and registry-backed discovery metadata connect
tokenizer-mismatchtotokenizers-overview,bpe,wordpiece,sentencepiece,special-tokens, andembedding, plus only those representative model or serving pages whose relationship is already justified by shipped content.","Search documents and at least one relevant tag or related-doc discovery surface can return or surface the canonical tokenizer mismatch page for representative queries such as
tokenizer mismatch,wrong tokenizer, orspecial token mismatch.","The landed slice remains English-only and does not reopen unrelated tokenization-family rewrites, locale infrastructure, or non-touched content bundles.",
"Quality gate: make typecheck, make lint, and focused tests for the touched page, messages, registry, and discovery behavior pass."
],
"userStories": [
{
"id": "tokenizer-mismatch-module-page-001",
"title": "Establish tokenizer mismatch as a first-class tokenization module",
"description": "As a reader browsing tokenizer topics, I want tokenizer mismatch classified as a canonical tokenization page so that related docs and taxonomy surfaces place it next to the right neighboring topics.",
"acceptanceCriteria": [
"A published registry record exists for
module.tokenizer-mismatchwith stable id, canonical slugtokenizer-mismatch,moduleType: tokenizer, tokenization-family grouping, valid aliases, tags, summary fields, and reviewer-checkable related-document metadata.","Registry classification follows
docs/documentation-site-pages-needed.mdby treatingtokenizer mismatchas a module-family tokenization page rather than a broad concept page or glossary entry.","Registry relationships connect
tokenizer-mismatchtotokenizers-overview,bpe,wordpiece,sentencepiece,special-tokens, andembedding, plus representative model or serving pages only where those relationships are concrete and non-duplicative.","The slice does not broaden into unrelated tokenization taxonomy cleanup outside the metadata required to land this page cleanly.",
"Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": ""
},
{
"id": "tokenizer-mismatch-module-page-002",
"title": "Publish the canonical tokenizer mismatch explainer page",
"description": "As a technical layperson learning how models consume text, I want a dedicated tokenizer mismatch page so that I can understand why using the wrong tokenizer causes practical failures or degraded behavior.",
"acceptanceCriteria": [
"A canonical module page exists at
/docs/modules/tokenizer-mismatchwith matching frontmatter,messages/en.json, and any required localassets.json.","The page opens with one folded
openingSummaryand explains tokenizer mismatch in plain language before narrowing into model-specific or serving-specific examples.","The page explains that a mismatch can come from using the wrong vocabulary, different token-splitting rules, incompatible special tokens, or prompt formatting assumptions that do not match the model's training or serving setup.",
"The page explains reader-visible consequences such as shifted token counts, broken chat-template boundaries, degraded completions, or failures around special tokens and embeddings, without turning into a benchmark page or implementation tutorial.",
"Customer-facing copy is English-only, understandable in isolation, and free of process or authoring meta language.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": ""
},
{
"id": "tokenizer-mismatch-module-page-003",
"title": "Make tokenizer mismatch discoverable from nearby tokenization and model-input paths",
"description": "As a reader who searches for tokenization problems or lands on nearby pages, I want discovery surfaces to route me into tokenizer mismatch so that I can find the explanation without already knowing the slug.",
"acceptanceCriteria": [
"Search documents include
/docs/modules/tokenizer-mismatchwith title, aliases, tags, and core body terms that let representative queries such astokenizer mismatch,wrong tokenizer,prompt token mismatch, andspecial token mismatchfind the page.","The canonical page renders tag and related-doc surfaces that connect it to
tokenizers-overview,bpe,wordpiece,sentencepiece,special-tokens, andembedding, plus representative model or serving pages where those canonical targets exist and are contextually justified.","At least one neighboring shipped discovery surface or related-doc path can lead readers into
tokenizer-mismatchwithout typing the route directly.","Browser-visible rendering shows the page title, summary, tags, and related-doc links without missing-content placeholders or broken links.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": ""
},
{
"id": "tokenizer-mismatch-module-page-004",
"title": "Add focused validation for the tokenizer mismatch page contract",
"description": "As a maintainer, I want targeted automated proof for the tokenizer mismatch slice so that route, registry, message, and discovery regressions are caught without unrelated test expansion.",
"acceptanceCriteria": [
"Focused validation or test coverage confirms the
tokenizer-mismatchpage route, registry record, and default English messages resolve together.","Coverage asserts at least one page-specific related-doc, tag, or search expectation for
tokenizer-mismatch.","Coverage fails if the page loses its canonical route, registry linkage, tokenization-family classification, or representative discovery visibility.",
"Validation stays scoped to observable behavior for this page slice and does not require unrelated inventory snapshots, locale-manifest churn, or repository-wide taxonomy counts.",
"Typecheck passes",
"Tests pass"
],
"priority": 4,
"passes": true,
"notes": ""
}
]
}