Skip to content

automata: fix bug in reverse suffix/inner optimization#1364

Draft
BurntSushi wants to merge 1 commit into
masterfrom
ag/fix-reverse-optimizations
Draft

automata: fix bug in reverse suffix/inner optimization#1364
BurntSushi wants to merge 1 commit into
masterfrom
ag/fix-reverse-optimizations

Conversation

@BurntSushi

Copy link
Copy Markdown
Member

A minimal reproducer of this bug is on a haystack of zabb with the
regex .bb|b. The regex crate will report a match at 2..3, but the
correct match is 1..4.

While this seems like a simple regex, there are a pretty specific set
of circumstances required to trigger the bug:

  1. There are no prefix literals that activate a standard prefix
    literal scan.
  2. There needs to be an extractable suffix or inner literal.
  3. An actual match needs to be present in the haystack.
  4. The regex and haystack

Crucially, note that because of (3), this bug will never lead to
Regex::is_match providing a false positive or a false negative.
This bug is strictly about leftmost-first match semantics being
incorrect in some cases and will report an incorrect match span.

(4) could do with a bit more explanation, since it's rather subtle.
Let's trace the minimal example through the regex crate's "reverse
suffix" optimization.

During compilation, there is no prefix literal that can be extracted.
The . defeats that class of optimization. Moreover, there is a suffix
literal in the regex. That is, all matches for .bb|b must end with
b. The regex crate sees this and will scan for matches of b. It will
then attempt to match the regex in reverse at each candidate match of
b. Let's see what happens:

  • Find first occurrence of b at offset 2 in zabb.
  • Start reverse confirmation step at offset 2.
  • The second alternation branch, b in .bb|b, matches at 2..3.
  • The second alternation branch is reported as the overall match.
    This happens because the first alternation branch, .bb, does not
    have a match ending at offset 3.

The fundamental problem here is that there is an overlap between the
reverse automaton for confirming the match and the literal scan. Small
changes, even to the haystack, can result in the bug disappearing.
For example, with a haystack of zbb, the correct match of 0..3 is
reported. This occurs because there is a quadratic "trip wire" that
triggers in this case that causes the search to bail out and fall back
to a DFA without using any literal optimizations.

This bug also applies to the "reverse inner" optimization. This
can happen when the literal is extracted from inside the regex
as opposed to it being a suffix literal. For example, the regex
(?:..acbb|b)a(?:c|d) on the haystack xzbacbbac reported a match at
2..5, but the correct match is 1..9.

Note that #1355 technically fixes this problem and is much simpler, but
in so doing, makes the reverse suffix and inner optimizations completely
ineffective.

Fixes #1354, Closes #1355

@BurntSushi

BurntSushi commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

I wrote this PR with an LLM, so this still needs careful review. From a glance, the solution here seems very over complicated. But it does pass tests and rebar benchmarks.

A minimal reproducer of this bug is on a haystack of `zabb` with the
regex `.bb|b`. The `regex` crate will report a match at `2..3`, but the
correct match is `1..4`.

While this seems like a simple regex, there are a pretty specific set
of circumstances required to trigger the bug:

1. There are no prefix literals that activate a standard prefix
   literal scan.
2. There needs to be an extractable *suffix* or *inner* literal.
3. An actual match needs to be present in the haystack.
4. The regex and haystack

Crucially, note that because of (3), this bug will never lead to
`Regex::is_match` providing a false positive *or* a false negative.
This bug is strictly about leftmost-first match semantics being
incorrect in some cases and will report an incorrect match span.

(4) could do with a bit more explanation, since it's rather subtle.
Let's trace the minimal example through the regex crate's "reverse
suffix" optimization.

During compilation, there is no prefix literal that can be extracted.
The `.` defeats that class of optimization. Moreover, there is a suffix
literal in the regex. That is, all matches for `.bb|b` must end with
`b`. The regex crate sees this and will scan for matches of `b`. It will
then attempt to match the regex in reverse at each candidate match of
`b`. Let's see what happens:

* Find first occurrence of `b` at offset `2` in `zabb`.
* Start reverse confirmation step at offset `2`.
* The second alternation branch, `b` in `.bb|b`, matches at `2..3`.
* The second alternation branch is reported as the overall match.
  This happens because the first alternation branch, `.bb`, does _not_
  have a match ending at offset `3`.

The fundamental problem here is that there is an overlap between the
reverse automaton for confirming the match and the literal scan. Small
changes, even to the haystack, can result in the bug disappearing.
For example, with a haystack of `zbb`, the correct match of `0..3` is
reported. This occurs because there is a quadratic "trip wire" that
triggers in this case that causes the search to bail out and fall back
to a DFA without using any literal optimizations.

This bug also applies to the "reverse inner" optimization. This
can happen when the literal is extracted from inside the regex
as opposed to it being a suffix literal. For example, the regex
`(?:..acbb|b)a(?:c|d)` on the haystack `xzbacbbac` reported a match at
`2..5`, but the correct match is `1..9`.

Note that #1355 technically fixes this problem and is much simpler, but
in so doing, makes the reverse suffix and inner optimizations completely
ineffective.

Fixes #1354, Closes #1355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reverse suffix doesn't return the leftmost match when alternates overlap

1 participant