feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439
feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439maximilianfalco wants to merge 14 commits into
Conversation
also removing the htmlblock preprocessing step
we can simplify it significantly since the tokenizer now ensures a single shape coming in
| } | ||
| // Reject so the block re-parses as a paragraph, deferring to the | ||
| // text tokenizer which preserves trailing content in the same line. | ||
| return nok(code); |
There was a problem hiding this comment.
even if the HTMLBlock is the start of the paragraph but has trailing text, it'll start as a flow construct and will entirely reject once it reaches the trailing text and defer to the text construct. this is so that the block and the trailing content is not split into 2 paragraphs
for cases like this basically
<HTMLBlock>{`<p>no crash trailing</p>`}</HTMLBlock> hello world
| * token at both flow (block) and text (inline) levels. | ||
| * | ||
| * Prevents the markdown parser from consuming `<script>/<style>` tags inside | ||
| * the block, and ensures blockquote `> ` markers are properly stripped before |
There was a problem hiding this comment.
the root cause of the bug ticket is that when HTML blocks are contained in callouts, their line is prefixed by a > as it should. but the protectHtmlBlock function protects the entire thing! including the prefix > which resulted in it appearing in the final product
eaglethrost
left a comment
There was a problem hiding this comment.
Overall looks really good and much more robust at parsing HTMLBlocks!
I think this can also solve RM-15880: Broken "View as markdown" for HTML Block and replace PR #1410, but that can may be in a separate PR since we need to add the tokenizer to stripComments
There was a problem hiding this comment.
Nice we can heavily simplify this!
| expect(htmlProp).toBe('<pre>```javascript\nconst x = 1;\n```</pre>'); | ||
| describe('inside lists', () => { | ||
| it('handles HTMLBlock in an unordered list item', () => { | ||
| const hast = mdxish('- <HTMLBlock>{`<span>listed</span>`}</HTMLBlock>'); |
There was a problem hiding this comment.
More tests suggestion:
- HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.
Hello <HTMLBlock>{\`
<p><strong">Hello</strong>, World!</p>
\`}</HTMLBlock>
there
- Nested HTMLBlocks
<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>
There was a problem hiding this comment.
More tests suggestion:
- HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.
Hello <HTMLBlock>{\` <p><strong">Hello</strong>, World!</p> \`}</HTMLBlock> there
i dont think we should cover this case tho, even MDX themselve dont support this so i think this is just plain malformed and wrong syntax
- Nested HTMLBlocks
<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>
added this one with a bit of a tweak!
…rm-16139-htmlblock-tokenizer
|
closing this PR as per this comment #1484 (comment) cc @eaglethrost |
| 🎫 Resolve [RM-16726](https://linear.app/readme-io/issue/RM-16726/htmlblock-not-rendering-in-tables) | | :-----------------: | ## 🎯 What does this PR do? To try to fix an issue where `<HTMLBlock>` is not rendering inside JSX `<Table>`, this PR makes substantial changes to how we parse HTMLBlocks syntax by moving away from the string-level content protection we've been doing and reusing the existing MDX tokenizer for it. **Root cause of rendering issue:** We have a preprocessing step in the pipeline where HTMLBlock bodies encoded into an HTML-comment marker (`<!--RDMX_HTMLBLOCK:…-->`) in `preprocessJSXExpressions`, then decoded back further down the pipeline to be transformed to HTMLBlock nodes. When the `<HTMLBlock>` is inside a `<Table>`, the table transformer which still has the encoded HTMLBlock fails to parse it since it uses remarkMdx which turns out rejects HTML comments, making the table never parsed. The blocks were encoded because we didn't want its content to be modified by other preprocessing steps & it's usage of the curly braces could cause expression parsing issues. **Approach:** We now actually can stop protecting and decoding. Now that the `mdxComponent` tokenizer can capture component bodies, including multiline `{`…`}` template literals, thanks to the brace-aware body states added in #1455, we can now let the tokenizer claim `<HTMLBlock>` and read its body straight from the parsed template-literal expression. No marker round-trip, no comment for remarkMdx to choke on. (This is the same direction as @maximilianfalco's HTMLBlock-tokenizer work in #1439.) **What changed:** - **Tokenizer claims `<HTMLBlock>`.** Split the exclusion set so the micromark `mdxComponent` construct captures `<HTMLBlock>` (new `TOKENIZER_MDX_COMPONENT_EXCLUDED_TAGS`), while the remark string-reparse transforms still leave it alone — re-parsing it there is what would mangle bodies containing unbalanced-looking braces. - **Adjust the html block transformer(`mdxish-html-blocks.ts`)** Now the transformer deals with different input data to extract: 1. **JSX element** (`mdxJsxFlowElement`/`mdxJsxTextElement`) — block context (e.g. `<Callout>`) and table cells (after their remarkMdx re-parse); 2. **Raw HTML blob** — single-line top-level, or nested in raw HTML like an inline `<div>` (CommonMark slurps these whole, so we split them back out); 3. **Inline-in-paragraph** — `<HTMLBlock>` open/close arriving as separate siblings around the expression. - **`mdxish-tables`** keeps a table as a JSX `<Table>` when a cell contains an `<HTMLBlock>` (block-level content a GFM cell can't represent). - **Removed the marker machinery entirely:** `protectHTMLBlockContent` + the `RDMX_HTMLBLOCK` markers, the base64 encode/decode paths, and the table-specific comment-neutralization workaround. HTMLBlock handling collapses from four locations down to one. ## 🧪 QA tips - [ ] Render an `<HTMLBlock>` inside a `<Table>` cell and confirm the HTML renders without breaking the table, and sibling cells still get markdown: ```mdx <Table> <tbody> <tr> <td>**bold** still works</td> <td><HTMLBlock>{`<div style="color: red;">Hello</div>`}</HTMLBlock></td> </tr> </tbody> </Table> ``` - [ ] Confirm `safeMode`/`runScripts` survive, and multiple HTMLBlocks in one table all render. - [ ] Confirm top-level `<HTMLBlock>` and `<HTMLBlock>` in a generic `<div>` still render as before. - [ ] New coverage added in `__tests__/lib/mdxish/html-blocks.test.ts`. Demo (before & after): https://github.com/user-attachments/assets/7b40ead4-a1d3-4053-8b3a-3b9513c9b730
🎯 What does this PR do?
protectHTMLBlockContentran on raw markdown before parsing, base64-encoding template content including blockquote>markers. The markdown parser never stripped them because they were hidden in the encoded blob.Introduces a dedicated micromark tokenizer for
<HTMLBlock>...</HTMLBlock>in MDXISH, replacing theprotectHTMLBlockContentpreprocessing hack that was causing stray>characters inside callouts.The tokenizer operates at both levels:
<HTMLBlock>at the start of a line with multiline content via line continuations. Rejects trailing non-whitespace so it falls through to the text tokenizer.Note
with the new tokenizer, the HTML block now comes in in a single unified shape so we can simplify the mdxish html block transformer significantly!
Warning
De/serialization for malformed HTML block syntax (like split opening tags and all that) is still kinda flaky but I do think this is fine for now just because these cases should never even happen in the first place. inserting HTML blocks via the editor would almost guarantee that the tags arent split that way so these theoritical cases should never happen in the first place
🧪 QA tips
All this should render fine when using the engine.
Malformed Source Code
More Testing Source Code
📸 Screenshot or Loom