Skip to content

feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439

Closed
maximilianfalco wants to merge 14 commits into
nextfrom
falco/rm-16139-htmlblock-tokenizer
Closed

feat(mdxish): introduce new HTMLBlock micromark tokenizer#1439
maximilianfalco wants to merge 14 commits into
nextfrom
falco/rm-16139-htmlblock-tokenizer

Conversation

@maximilianfalco

@maximilianfalco maximilianfalco commented Apr 16, 2026

Copy link
Copy Markdown
Contributor
🎫 Resolve RM-16139

🎯 What does this PR do?

protectHTMLBlockContent ran on raw markdown before parsing, base64-encoding template content including blockquote > markers. The markdown parser never stripped them because they were hidden in the encoded blob.

Introduces a dedicated micromark tokenizer for <HTMLBlock>...</HTMLBlock> in MDXISH, replacing the protectHTMLBlockContent preprocessing hack that was causing stray > characters inside callouts.

The tokenizer operates at both levels:

  • Flow (block-level): handles <HTMLBlock> at the start of a line with multiline content via line continuations. Rejects trailing non-whitespace so it falls through to the text tokenizer.
  • Text (inline): handles inline HTMLBlocks within paragraphs. Can span lines. Trailing text naturally stays in the same paragraph, matching MDX behavior.

Note

with the new tokenizer, the HTML block now comes in in a single unified shape so we can simplify the mdxish html block transformer significantly!

Warning

De/serialization for malformed HTML block syntax (like split opening tags and all that) is still kinda flaky but I do think this is fine for now just because these cases should never even happen in the first place. inserting HTML blocks via the editor would almost guarantee that the tags arent split that way so these theoritical cases should never happen in the first place

🧪 QA tips

All this should render fine when using the engine.

Malformed Source Code
> 🚧 It compiles!
>
> new thing
> <HTMLBlock>{`
>   <strong style="color: olive">Hello, Worldsdfsdf!</strong>
> `}</HTMLBlock> 

fsdfsdf

hello <HTMLBlock>{`<strong style="color: olive">Hello, World!</strong>`}</HTMLBlock> world

<HTMLBlock>{`<strong style="color: olive">Hello, World!</strong>`}</HTMLBlock> hello

hello <HTMLBlock>{`<strong style="color: o
live">Hello, World!</strong>`}</HTMLBlock>.  sdfsdfsdsdfsdf

<HTMLBlock>


{`sdfsd
  <strong style="color: olive">Hello, 


<br> Worldsdfsdf!</strong>


`}</HTMLBlock>
More Testing Source Code
> 🚧 It compiles!
>
> <HTMLBlock>{`
>   <strong style="color: olive">Hello, World!</strong>
> `}</HTMLBlock>

> ⚠️  Warning
>
> <HTMLBlock>{`
>   <script>alert("test")</script>
>   <p>safe content</p>
> `}</HTMLBlock>

> 🚧 Blank lines in callout
>
> <HTMLBlock>{`
>   <div>first</div>
>
>   <div>second</div>
> `}</HTMLBlock>

> 📘
>
> <HTMLBlock>{`<p>empty callout title</p>`}</HTMLBlock>

<HTMLBlock>{`<p>plain blockquote</p>`}</HTMLBlock>

<HTMLBlock>{`
  <div>multiline blockquote line1</div>
  <div>multiline blockquote line2</div>
`}</HTMLBlock>

<HTMLBlock>{`<div>simple flow block</div>`}</HTMLBlock>

<HTMLBlock>{`
<ul>
  <li>multiline</li>
  <li>flow block</li>
</ul>
`}</HTMLBlock>

<HTMLBlock>{`
<div>before blank line</div>
<div>after blank line</div>
`}</HTMLBlock>

<HTMLBlock>{`
<script>console.log("script not consumed")</script>
<p>visible</p>
`}</HTMLBlock>

<HTMLBlock>{`
<style>.red { color: red; }</style>
<p class="red">styled</p>
`}</HTMLBlock>

<HTMLBlock><em>no template literal</em></HTMLBlock>

<HTMLBlock 
safeMode={true}>{`<script>alert("XSS")</script><p>Content</p>`}</HTMLBlock>

<HTMLBlock>{`<div>trailing whitespace</div>`}</HTMLBlock>

before <HTMLBlock>{`<span>inline middle</span>`}</HTMLBlock> after

some text <HTMLBlock>{`<em>inline end</em>`}</HTMLBlock>

text <HTMLBlock safeMode="true">{`<script>xss</script>`}</HTMLBlock> more

hello <HTMLBlock>{`<strong style="color: o
live">multiline inline</strong>`}</HTMLBlock> world

<HTMLBlock>{`<strong>flow with trailing</strong>`}</HTMLBlock>
trailing text

<HTMLBlock>{`<strong style="color: o
live">multiline flow trailing</strong>`}</HTMLBlock>. trailing after multiline

<HTMLBlock>{`<p>no crash trailing</p>`}</HTMLBlock> hello world

<HTMLBlock>{`<p>no crash punctuation</p>`}</HTMLBlock>. stuff

- <HTMLBlock>{`<span>unordered list</span>`}</HTMLBlock>

1. <HTMLBlock>{`<span>ordered list</span>`}</HTMLBlock>

<HTMLBlock>{`
<pre><code>
const foo = () => {
  const bar = {
    baz: 'blammo'
  }
  return bar
}
</code></pre>
`}</HTMLBlock>

<HTMLBlock>
{`sdfsd
  <strong style="color: olive">Hello,
<br> Worldsdfsdf!</strong>
`}</HTMLBlock>

📸 Screenshot or Loom

Before After
Screenshot 2026-04-17 at 00 21 07 Screenshot 2026-04-16 at 18 25 00

also removing the htmlblock preprocessing step
we can simplify it significantly since the tokenizer now ensures a
single shape coming in
@maximilianfalco maximilianfalco marked this pull request as ready for review April 16, 2026 13:56
}
// Reject so the block re-parses as a paragraph, deferring to the
// text tokenizer which preserves trailing content in the same line.
return nok(code);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if the HTMLBlock is the start of the paragraph but has trailing text, it'll start as a flow construct and will entirely reject once it reaches the trailing text and defer to the text construct. this is so that the block and the trailing content is not split into 2 paragraphs

for cases like this basically

<HTMLBlock>{`<p>no crash trailing</p>`}</HTMLBlock> hello world

* token at both flow (block) and text (inline) levels.
*
* Prevents the markdown parser from consuming `<script>/<style>` tags inside
* the block, and ensures blockquote `> ` markers are properly stripped before

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the root cause of the bug ticket is that when HTML blocks are contained in callouts, their line is prefixed by a > as it should. but the protectHtmlBlock function protects the entire thing! including the prefix > which resulted in it appearing in the final product

@eaglethrost eaglethrost left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good and much more robust at parsing HTMLBlocks!

I think this can also solve RM-15880: Broken "View as markdown" for HTML Block and replace PR #1410, but that can may be in a separate PR since we need to add the tokenizer to stripComments

Comment thread __tests__/compilers/html-block.test.ts
Comment thread lib/micromark/html-block-component/syntax.ts Outdated
Comment thread lib/mdxish.ts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice we can heavily simplify this!

Comment thread processor/transform/mdxish/mdxish-html-blocks.ts
Comment thread components/HTMLBlock/index.tsx
Comment thread __tests__/compilers/html-block.test.ts Outdated
expect(htmlProp).toBe('<pre>```javascript\nconst x = 1;\n```</pre>');
describe('inside lists', () => {
it('handles HTMLBlock in an unordered list item', () => {
const hast = mdxish('- <HTMLBlock>{`<span>listed</span>`}</HTMLBlock>');

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests suggestion:

  • HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.
Hello <HTMLBlock>{\`
<p><strong">Hello</strong>, World!</p>

\`}</HTMLBlock>

there
  • Nested HTMLBlocks
<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests suggestion:

  • HTMLBlock that spans multiple lines and have text before & after it, and having some random amount of empty spaces in between. E.g.
Hello <HTMLBlock>{\`
<p><strong">Hello</strong>, World!</p>

\`}</HTMLBlock>

there

i dont think we should cover this case tho, even MDX themselve dont support this so i think this is just plain malformed and wrong syntax

  • Nested HTMLBlocks
<HTMLBlock>{`<HTMLBlock>{inner}</HTMLBlock>`}</HTMLBlock>

added this one with a bit of a tweak!

@maximilianfalco

Copy link
Copy Markdown
Contributor Author

closing this PR as per this comment #1484 (comment)

cc @eaglethrost

eaglethrost added a commit that referenced this pull request Jun 4, 2026
| 🎫 Resolve
[RM-16726](https://linear.app/readme-io/issue/RM-16726/htmlblock-not-rendering-in-tables)
|
| :-----------------: |

## 🎯 What does this PR do?

To try to fix an issue where `<HTMLBlock>` is not rendering inside JSX
`<Table>`, this PR makes substantial changes to how we parse HTMLBlocks
syntax by moving away from the string-level content protection we've
been doing and reusing the existing MDX tokenizer for it.

**Root cause of rendering issue:** We have a preprocessing step in the
pipeline where HTMLBlock bodies encoded into an HTML-comment marker
(`<!--RDMX_HTMLBLOCK:…-->`) in `preprocessJSXExpressions`, then decoded
back further down the pipeline to be transformed to HTMLBlock nodes.
When the `<HTMLBlock>` is inside a `<Table>`, the table transformer
which still has the encoded HTMLBlock fails to parse it since it uses
remarkMdx which turns out rejects HTML comments, making the table never
parsed. The blocks were encoded because we didn't want its content to be
modified by other preprocessing steps & it's usage of the curly braces
could cause expression parsing issues.

**Approach:** We now actually can stop protecting and decoding. Now that
the `mdxComponent` tokenizer can capture component bodies, including
multiline `{`…`}` template literals, thanks to the brace-aware body
states added in #1455, we can now let the tokenizer claim `<HTMLBlock>`
and read its body straight from the parsed template-literal expression.
No marker round-trip, no comment for remarkMdx to choke on. (This is the
same direction as @maximilianfalco's HTMLBlock-tokenizer work in #1439.)

**What changed:**

- **Tokenizer claims `<HTMLBlock>`.** Split the exclusion set so the
micromark `mdxComponent` construct captures `<HTMLBlock>` (new
`TOKENIZER_MDX_COMPONENT_EXCLUDED_TAGS`), while the remark
string-reparse transforms still leave it alone — re-parsing it there is
what would mangle bodies containing unbalanced-looking braces.
- **Adjust the html block transformer(`mdxish-html-blocks.ts`)** Now the
transformer deals with different input data to extract:
1. **JSX element** (`mdxJsxFlowElement`/`mdxJsxTextElement`) — block
context (e.g. `<Callout>`) and table cells (after their remarkMdx
re-parse);
2. **Raw HTML blob** — single-line top-level, or nested in raw HTML like
an inline `<div>` (CommonMark slurps these whole, so we split them back
out);
3. **Inline-in-paragraph** — `<HTMLBlock>` open/close arriving as
separate siblings around the expression.
- **`mdxish-tables`** keeps a table as a JSX `<Table>` when a cell
contains an `<HTMLBlock>` (block-level content a GFM cell can't
represent).
- **Removed the marker machinery entirely:** `protectHTMLBlockContent` +
the `RDMX_HTMLBLOCK` markers, the base64 encode/decode paths, and the
table-specific comment-neutralization workaround. HTMLBlock handling
collapses from four locations down to one.

## 🧪 QA tips

- [ ] Render an `<HTMLBlock>` inside a `<Table>` cell and confirm the
HTML renders without breaking the table, and sibling cells still get
markdown:
  ```mdx
  <Table>
    <tbody>
      <tr>
        <td>**bold** still works</td>
<td><HTMLBlock>{`<div style="color: red;">Hello</div>`}</HTMLBlock></td>
      </tr>
    </tbody>
  </Table>
  ```
- [ ] Confirm `safeMode`/`runScripts` survive, and multiple HTMLBlocks
in one table all render.
- [ ] Confirm top-level `<HTMLBlock>` and `<HTMLBlock>` in a generic
`<div>` still render as before.
- [ ] New coverage added in `__tests__/lib/mdxish/html-blocks.test.ts`.

Demo (before & after):


https://github.com/user-attachments/assets/7b40ead4-a1d3-4053-8b3a-3b9513c9b730
jarrod-lyra added a commit that referenced this pull request Jun 5, 2026
…RM-16888)

Adds the HTMLBlock micromark tokenizer from PR #1439 to the stripComments
pipeline, preventing multiline HTMLBlock content from crashing the parser
when htmlFlow intercepts inner HTML tags.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants