automata: fix bug in reverse suffix/inner optimization#1364
Draft
BurntSushi wants to merge 1 commit into
Draft
Conversation
Member
Author
|
I wrote this PR with an LLM, so this still needs careful review. From a glance, the solution here seems very over complicated. But it does pass tests and rebar benchmarks. |
A minimal reproducer of this bug is on a haystack of `zabb` with the regex `.bb|b`. The `regex` crate will report a match at `2..3`, but the correct match is `1..4`. While this seems like a simple regex, there are a pretty specific set of circumstances required to trigger the bug: 1. There are no prefix literals that activate a standard prefix literal scan. 2. There needs to be an extractable *suffix* or *inner* literal. 3. An actual match needs to be present in the haystack. 4. The regex and haystack Crucially, note that because of (3), this bug will never lead to `Regex::is_match` providing a false positive *or* a false negative. This bug is strictly about leftmost-first match semantics being incorrect in some cases and will report an incorrect match span. (4) could do with a bit more explanation, since it's rather subtle. Let's trace the minimal example through the regex crate's "reverse suffix" optimization. During compilation, there is no prefix literal that can be extracted. The `.` defeats that class of optimization. Moreover, there is a suffix literal in the regex. That is, all matches for `.bb|b` must end with `b`. The regex crate sees this and will scan for matches of `b`. It will then attempt to match the regex in reverse at each candidate match of `b`. Let's see what happens: * Find first occurrence of `b` at offset `2` in `zabb`. * Start reverse confirmation step at offset `2`. * The second alternation branch, `b` in `.bb|b`, matches at `2..3`. * The second alternation branch is reported as the overall match. This happens because the first alternation branch, `.bb`, does _not_ have a match ending at offset `3`. The fundamental problem here is that there is an overlap between the reverse automaton for confirming the match and the literal scan. Small changes, even to the haystack, can result in the bug disappearing. For example, with a haystack of `zbb`, the correct match of `0..3` is reported. This occurs because there is a quadratic "trip wire" that triggers in this case that causes the search to bail out and fall back to a DFA without using any literal optimizations. This bug also applies to the "reverse inner" optimization. This can happen when the literal is extracted from inside the regex as opposed to it being a suffix literal. For example, the regex `(?:..acbb|b)a(?:c|d)` on the haystack `xzbacbbac` reported a match at `2..5`, but the correct match is `1..9`. Note that #1355 technically fixes this problem and is much simpler, but in so doing, makes the reverse suffix and inner optimizations completely ineffective. Fixes #1354, Closes #1355
a9c5543 to
fcc4980
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A minimal reproducer of this bug is on a haystack of
zabbwith theregex
.bb|b. Theregexcrate will report a match at2..3, but thecorrect match is
1..4.While this seems like a simple regex, there are a pretty specific set
of circumstances required to trigger the bug:
literal scan.
Crucially, note that because of (3), this bug will never lead to
Regex::is_matchproviding a false positive or a false negative.This bug is strictly about leftmost-first match semantics being
incorrect in some cases and will report an incorrect match span.
(4) could do with a bit more explanation, since it's rather subtle.
Let's trace the minimal example through the regex crate's "reverse
suffix" optimization.
During compilation, there is no prefix literal that can be extracted.
The
.defeats that class of optimization. Moreover, there is a suffixliteral in the regex. That is, all matches for
.bb|bmust end withb. The regex crate sees this and will scan for matches ofb. It willthen attempt to match the regex in reverse at each candidate match of
b. Let's see what happens:bat offset2inzabb.2.bin.bb|b, matches at2..3.This happens because the first alternation branch,
.bb, does nothave a match ending at offset
3.The fundamental problem here is that there is an overlap between the
reverse automaton for confirming the match and the literal scan. Small
changes, even to the haystack, can result in the bug disappearing.
For example, with a haystack of
zbb, the correct match of0..3isreported. This occurs because there is a quadratic "trip wire" that
triggers in this case that causes the search to bail out and fall back
to a DFA without using any literal optimizations.
This bug also applies to the "reverse inner" optimization. This
can happen when the literal is extracted from inside the regex
as opposed to it being a suffix literal. For example, the regex
(?:..acbb|b)a(?:c|d)on the haystackxzbacbbacreported a match at2..5, but the correct match is1..9.Note that #1355 technically fixes this problem and is much simpler, but
in so doing, makes the reverse suffix and inner optimizations completely
ineffective.
Fixes #1354, Closes #1355