feat(mdxish): introduce new HTMLBlock micromark tokenizer by maximilianfalco · Pull Request #1439 · readmeio/markdown

maximilianfalco · 2026-04-16T08:24:27Z

🎫 Resolve RM-16139

🎯 What does this PR do?

protectHTMLBlockContent ran on raw markdown before parsing, base64-encoding template content including blockquote > markers. The markdown parser never stripped them because they were hidden in the encoded blob.

Introduces a dedicated micromark tokenizer for <HTMLBlock>...</HTMLBlock> in MDXISH, replacing the protectHTMLBlockContent preprocessing hack that was causing stray > characters inside callouts.

The tokenizer operates at both levels:

Flow (block-level): handles <HTMLBlock> at the start of a line with multiline content via line continuations. Rejects trailing non-whitespace so it falls through to the text tokenizer.
Text (inline): handles inline HTMLBlocks within paragraphs. Can span lines. Trailing text naturally stays in the same paragraph, matching MDX behavior.

Note

with the new tokenizer, the HTML block now comes in in a single unified shape so we can simplify the mdxish html block transformer significantly!

Warning

De/serialization for malformed HTML block syntax (like split opening tags and all that) is still kinda flaky but I do think this is fine for now just because these cases should never even happen in the first place. inserting HTML blocks via the editor would almost guarantee that the tags arent split that way so these theoritical cases should never happen in the first place

🧪 QA tips

All this should render fine when using the engine.

Malformed Source Code

> 🚧 It compiles!
>
> new thing
> <HTMLBlock>{`
>   <strong style="color: olive">Hello, Worldsdfsdf!</strong>
> `}</HTMLBlock> 

fsdfsdf

hello <HTMLBlock>{`<strong style="color: olive">Hello, World!</strong>`}</HTMLBlock> world

<HTMLBlock>{`<strong style="color: olive">Hello, World!</strong>`}</HTMLBlock> hello

hello <HTMLBlock>{`<strong style="color: o
live">Hello, World!</strong>`}</HTMLBlock>.  sdfsdfsdsdfsdf

<HTMLBlock>


{`sdfsd
  <strong style="color: olive">Hello, 


<br> Worldsdfsdf!</strong>


`}</HTMLBlock>

More Testing Source Code

> 🚧 It compiles!
>
> <HTMLBlock>{`
>   <strong style="color: olive">Hello, World!</strong>
> `}</HTMLBlock>

> ⚠️  Warning
>
> <HTMLBlock>{`
>   <script>alert("test")</script>
>   <p>safe content</p>
> `}</HTMLBlock>

> 🚧 Blank lines in callout
>
> <HTMLBlock>{`
>   <div>first</div>
>
>   <div>second</div>
> `}</HTMLBlock>

> 📘
>
> <HTMLBlock>{`<p>empty callout title</p>`}</HTMLBlock>

<HTMLBlock>{`<p>plain blockquote</p>`}</HTMLBlock>

<HTMLBlock>{`
  <div>multiline blockquote line1</div>
  <div>multiline blockquote line2</div>
`}</HTMLBlock>

<HTMLBlock>{`<div>simple flow block</div>`}</HTMLBlock>

<HTMLBlock>{`
<ul>
  <li>multiline</li>
  <li>flow block</li>
</ul>
`}</HTMLBlock>

<HTMLBlock>{`
<div>before blank line</div>
<div>after blank line</div>
`}</HTMLBlock>

<HTMLBlock>{`
<script>console.log("script not consumed")</script>
<p>visible</p>
`}</HTMLBlock>

<HTMLBlock>{`
<style>.red { color: red; }</style>
<p class="red">styled</p>
`}</HTMLBlock>

<HTMLBlock><em>no template literal</em></HTMLBlock>

<HTMLBlock 
safeMode={true}>{`<script>alert("XSS")</script><p>Content</p>`}</HTMLBlock>

<HTMLBlock>{`<div>trailing whitespace</div>`}</HTMLBlock>

before <HTMLBlock>{`<span>inline middle</span>`}</HTMLBlock> after

some text <HTMLBlock>{`<em>inline end</em>`}</HTMLBlock>

text <HTMLBlock safeMode="true">{`<script>xss</script>`}</HTMLBlock> more

hello <HTMLBlock>{`<strong style="color: o
live">multiline inline</strong>`}</HTMLBlock> world

<HTMLBlock>{`<strong>flow with trailing</strong>`}</HTMLBlock>
trailing text

<HTMLBlock>{`<strong style="color: o
live">multiline flow trailing</strong>`}</HTMLBlock>. trailing after multiline

<HTMLBlock>{`<p>no crash trailing</p>`}</HTMLBlock> hello world

<HTMLBlock>{`<p>no crash punctuation</p>`}</HTMLBlock>. stuff

- <HTMLBlock>{`<span>unordered list</span>`}</HTMLBlock>

1. <HTMLBlock>{`<span>ordered list</span>`}</HTMLBlock>

<HTMLBlock>{`
<pre><code>
const foo = () => {
  const bar = {
    baz: 'blammo'
  }
  return bar
}
</code></pre>
`}</HTMLBlock>

<HTMLBlock>
{`sdfsd
  <strong style="color: olive">Hello,
<br> Worldsdfsdf!</strong>
`}</HTMLBlock>

📸 Screenshot or Loom

Before	After

also removing the htmlblock preprocessing step

we can simplify it significantly since the tokenizer now ensures a single shape coming in

…rm-16139-htmlblock-tokenizer

maximilianfalco · 2026-04-16T14:25:07Z

+      }
+      // Reject so the block re-parses as a paragraph, deferring to the
+      // text tokenizer which preserves trailing content in the same line.
+      return nok(code);


even if the HTMLBlock is the start of the paragraph but has trailing text, it'll start as a flow construct and will entirely reject once it reaches the trailing text and defer to the text construct. this is so that the block and the trailing content is not split into 2 paragraphs

for cases like this basically

<HTMLBlock>{`<p>no crash trailing</p>`}</HTMLBlock> hello world

maximilianfalco · 2026-04-16T14:26:38Z

+ * token at both flow (block) and text (inline) levels.
+ *
+ * Prevents the markdown parser from consuming `<script>/<style>` tags inside
+ * the block, and ensures blockquote `> ` markers are properly stripped before


the root cause of the bug ticket is that when HTML blocks are contained in callouts, their line is prefixed by a > as it should. but the protectHtmlBlock function protects the entire thing! including the prefix > which resulted in it appearing in the final product

eaglethrost

Overall looks really good and much more robust at parsing HTMLBlocks!

I think this can also solve RM-15880: Broken "View as markdown" for HTML Block and replace PR #1410, but that can may be in a separate PR since we need to add the tokenizer to stripComments

eaglethrost · 2026-04-17T04:28:42Z

Nice we can heavily simplify this!

eaglethrost · 2026-04-17T04:38:54Z

-    expect(htmlProp).toBe('<pre>```javascript\nconst x = 1;\n```</pre>');
+  describe('inside lists', () => {
+    it('handles HTMLBlock in an unordered list item', () => {
+      const hast = mdxish('- <HTMLBlock>{`<span>listed</span>`}</HTMLBlock>');


eaglethrost · 2026-04-17T04:41:40Z

More tests suggestion:

HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.

Hello <HTMLBlock>{\` <p><strong">Hello</strong>, World!</p> \`}</HTMLBlock> there

Nested HTMLBlocks

<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>

More tests suggestion:

HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.

Hello <HTMLBlock>{\` <p><strong">Hello</strong>, World!</p> \`}</HTMLBlock> there

i dont think we should cover this case tho, even MDX themselve dont support this so i think this is just plain malformed and wrong syntax

Nested HTMLBlocks

<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>

added this one with a bit of a tweak!

…rm-16139-htmlblock-tokenizer

maximilianfalco · 2026-05-27T23:53:08Z

closing this PR as per this comment #1484 (comment)

cc @eaglethrost

@maximilianfalco

| 🎫 Resolve [RM-16726](https://linear.app/readme-io/issue/RM-16726/htmlblock-not-rendering-in-tables) | | :-----------------: | ## 🎯 What does this PR do? To try to fix an issue where `<HTMLBlock>` is not rendering inside JSX `<Table>`, this PR makes substantial changes to how we parse HTMLBlocks syntax by moving away from the string-level content protection we've been doing and reusing the existing MDX tokenizer for it. **Root cause of rendering issue:** We have a preprocessing step in the pipeline where HTMLBlock bodies encoded into an HTML-comment marker (``) in `preprocessJSXExpressions`, then decoded back further down the pipeline to be transformed to HTMLBlock nodes. When the `<HTMLBlock>` is inside a `<Table>`, the table transformer which still has the encoded HTMLBlock fails to parse it since it uses remarkMdx which turns out rejects HTML comments, making the table never parsed. The blocks were encoded because we didn't want its content to be modified by other preprocessing steps & it's usage of the curly braces could cause expression parsing issues. **Approach:** We now actually can stop protecting and decoding. Now that the `mdxComponent` tokenizer can capture component bodies, including multiline `{`…`}` template literals, thanks to the brace-aware body states added in #1455, we can now let the tokenizer claim `<HTMLBlock>` and read its body straight from the parsed template-literal expression. No marker round-trip, no comment for remarkMdx to choke on. (This is the same direction as @maximilianfalco's HTMLBlock-tokenizer work in #1439.) **What changed:** - **Tokenizer claims `<HTMLBlock>`.** Split the exclusion set so the micromark `mdxComponent` construct captures `<HTMLBlock>` (new `TOKENIZER_MDX_COMPONENT_EXCLUDED_TAGS`), while the remark string-reparse transforms still leave it alone — re-parsing it there is what would mangle bodies containing unbalanced-looking braces. - **Adjust the html block transformer(`mdxish-html-blocks.ts`)** Now the transformer deals with different input data to extract: 1. **JSX element** (`mdxJsxFlowElement`/`mdxJsxTextElement`) — block context (e.g. `<Callout>`) and table cells (after their remarkMdx re-parse); 2. **Raw HTML blob** — single-line top-level, or nested in raw HTML like an inline `<div>` (CommonMark slurps these whole, so we split them back out); 3. **Inline-in-paragraph** — `<HTMLBlock>` open/close arriving as separate siblings around the expression. - **`mdxish-tables`** keeps a table as a JSX `<Table>` when a cell contains an `<HTMLBlock>` (block-level content a GFM cell can't represent). - **Removed the marker machinery entirely:** `protectHTMLBlockContent` + the `RDMX_HTMLBLOCK` markers, the base64 encode/decode paths, and the table-specific comment-neutralization workaround. HTMLBlock handling collapses from four locations down to one. ## 🧪 QA tips - [ ] Render an `<HTMLBlock>` inside a `<Table>` cell and confirm the HTML renders without breaking the table, and sibling cells still get markdown: ```mdx <Table> <tbody> <tr> <td>**bold** still works</td> <td><HTMLBlock>{`<div style="color: red;">Hello</div>`}</HTMLBlock></td> </tr> </tbody> </Table> ``` - [ ] Confirm `safeMode`/`runScripts` survive, and multiple HTMLBlocks in one table all render. - [ ] Confirm top-level `<HTMLBlock>` and `<HTMLBlock>` in a generic `<div>` still render as before. - [ ] New coverage added in `__tests__/lib/mdxish/html-blocks.test.ts`. Demo (before & after): https://github.com/user-attachments/assets/7b40ead4-a1d3-4053-8b3a-3b9513c9b730

…RM-16888) Adds the HTMLBlock micromark tokenizer from PR #1439 to the stripComments pipeline, preventing multiline HTMLBlock content from crashing the parser when htmlFlow intercepts inner HTML tags.

maximilianfalco added 4 commits April 16, 2026 15:34

feat: add new htmlblock tokenizer

27c7a48

feat: integrate new tokenizer into mdxish pipeline

f9788a0

also removing the htmlblock preprocessing step

fix: simplify mdxishHtmlBlocks transformer

631aebd

we can simplify it significantly since the tokenizer now ensures a single shape coming in

fix: preserve trailing text after closing tag

3001457

maximilianfalco requested review from Jadenzzz and eaglethrost April 16, 2026 08:28

maximilianfalco added 4 commits April 16, 2026 18:52

fix: allow htmlblock opening tag to span multiple lines

fb6450e

chore: add tests

d804324

fix: preserve trailing text after closing tag

18a081d

Merge branch 'next' of https://github.com/readmeio/markdown into falco/…

47b87b1

…rm-16139-htmlblock-tokenizer

maximilianfalco requested review from kevinports and rafegoldberg April 16, 2026 13:56

maximilianfalco marked this pull request as ready for review April 16, 2026 13:56

maximilianfalco commented Apr 16, 2026

View reviewed changes

eaglethrost reviewed Apr 17, 2026

View reviewed changes

maximilianfalco added 6 commits April 17, 2026 15:20

Merge branch 'next' of https://github.com/readmeio/markdown into falco/…

26f1124

…rm-16139-htmlblock-tokenizer

chore: removed as State casts

69def82

chore: add stricter and more robust test expectations

10d4f91

chore: add htmlblock transformer tests

24390a9

chore: move transformer tests in to the transformer test file

012ebb8

chore: tweak nested html block test

989aa8f

maximilianfalco mentioned this pull request May 27, 2026

fix(mdxish): <HTMLBlocks> inside <Table> not rendering #1484

Merged

4 tasks

maximilianfalco closed this May 27, 2026

jarrod-lyra mentioned this pull request Jun 5, 2026

fix: extract HTMLBlocks before parsing in stripComments #1506

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439

feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439
maximilianfalco wants to merge 14 commits into
nextfrom
falco/rm-16139-htmlblock-tokenizer

maximilianfalco commented Apr 16, 2026 •

edited

Loading

Uh oh!

maximilianfalco Apr 16, 2026

Uh oh!

maximilianfalco Apr 16, 2026

Uh oh!

eaglethrost left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eaglethrost Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

eaglethrost Apr 17, 2026

Uh oh!

eaglethrost Apr 17, 2026

Uh oh!

maximilianfalco Apr 17, 2026

Uh oh!

maximilianfalco commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maximilianfalco commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What does this PR do?

🧪 QA tips

📸 Screenshot or Loom

Uh oh!

maximilianfalco Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

maximilianfalco Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

eaglethrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eaglethrost Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eaglethrost Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

eaglethrost Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

maximilianfalco Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

maximilianfalco commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maximilianfalco commented Apr 16, 2026 •

edited

Loading