Skip to content

fix(filters): escape - in regex_url character class (#1601)#1880

Merged
sarahyurick merged 14 commits into
NVIDIA-NeMo:mainfrom
SAY-5:fix/regex-url-overly-permissive-range-1601
May 12, 2026
Merged

fix(filters): escape - in regex_url character class (#1601)#1880
sarahyurick merged 14 commits into
NVIDIA-NeMo:mainfrom
SAY-5:fix/regex-url-overly-permissive-range-1601

Conversation

@SAY-5
Copy link
Copy Markdown
Contributor

@SAY-5 SAY-5 commented Apr 28, 2026

Closes #1601.

Summary

  • The regex_url pattern in nemo_curator/stages/text/utils/constants.py had the unescaped class [$-_@.&+]. Python's re engine reads $-_ as a range from U+0024 ($) through U+005F (_), which silently includes <, >, ;, :, =, etc.
  • Result: regex_url.findall("see http://x.com<bad> end") previously returned ['http://x.com<bad>'] instead of ['http://x.com']. The pattern is used by UrlsFilter and RepeatingTopNGramsFilter (called transitively via repeating_n_gram paths), so this bleed inflates the urls_ratio score on documents that interleave URLs with HTML or punctuation.
  • Fix: escape the - ([$\-_@.&+]) so it is treated as a literal character. The class now contains exactly the set of URL-safe characters the original author intended.

Behavior change

Characters that the buggy range accidentally admitted (<, >, ;, :, =, ASCII 0x25, 0x5F that weren't otherwise listed) are no longer captured. Characters the original code intentionally listed ($, _, @, ., &, +, -, !, *, (, ), ,, percent-encoded escapes) all still match.

Test plan

  • Added tests/stages/text/utils/test_regex_url.py with four regression cases:
    • <bad> after a URL no longer extends the match,
    • ;next after a URL no longer extends the match,
    • all explicitly-allowed characters still match,
    • both http:// and https:// continue to match.
  • Verified the new tests fail against the unfixed regex and pass against the fixed one (standalone Python re reproduction; full pytest run requires the Ray/cosmos_xenna stack which CI will exercise).

Authorship

Signed off per DCO. The patch is one line; the regression tests document why.

The `regex_url` pattern used in heuristic string filters contained
the unescaped class `[$-_@.&+]`. Python's re engine reads `$-_` as
a *range* spanning U+0024 ($) through U+005F (_), which silently
includes `<`, `>`, `;`, `:`, `=`, and other punctuation. Matches
therefore bled past the end of legitimate URLs into surrounding
HTML/punctuation:

    re.findall(URL_RE, 'see http://x.com<bad> end')
    # before: ['http://x.com<bad>']
    # after:  ['http://x.com']

Escaping the `-` makes the regex engine treat it as a literal char,
so the class is exactly the set of URL-safe characters the original
author intended ([$_@.&+]).

Adds regression tests under tests/stages/text/utils/test_regex_url.py
covering: angle-bracket bleed, semicolon bleed, retention of all
explicitly-allowed characters, and parity for `http`/`https`.

Signed-off-by: SAY-5 <say.apm35@gmail.com>
@SAY-5 SAY-5 requested a review from a team as a code owner April 28, 2026 08:48
@SAY-5 SAY-5 requested review from huvunvidia and removed request for a team April 28, 2026 08:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR fixes a silent regex bug in regex_url where the unescaped [$-_...] character class was interpreted as a Unicode range (U+0024–U+005F), causing URL matches to bleed into adjacent HTML/punctuation. The fix escapes - and also adds :, =, ?, #, ~ — characters that were previously only matched as accidental side-effects of the broken range but are genuine RFC 3986 URL characters. A url_regex injection hook is also added to both UrlsFilter and PornographicUrlsFilter.

  • constants.py: Replaces the broken pattern with r"https?://(?:[A-Za-z0-9$\\-_@.&+/:=?#~]|[!*\\(\\),]|(?:%[0-9A-Fa-f]{2}))+". The escaped -, explicit /, and added :=?#~ address all issues surfaced in the issue thread.
  • string.py: UrlsFilter and PornographicUrlsFilter each grow an optional url_regex parameter (string or compiled re.Pattern) so callers can substitute a stricter or looser pattern without subclassing.
  • test_filters.py: Four regression tests cover the bleeding-HTML case, path/query/fragment matching, the full intended allowed-character set, and both string and pre-compiled pattern injection.

Confidence Score: 5/5

The change is safe to merge — the regex fix is correct, all previously-missing URL characters are now explicitly listed, and the customisation hook is well-guarded.

The regex fix is correct: - is properly escaped, and all characters previously captured only as an accidental side-effect of the broken range (/, :, =, ?, #) are now explicitly included. The url_regex injection handles the string/compiled-pattern/None cases correctly. Regression tests cover the key before/after behaviors.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/utils/constants.py Core fix: - is escaped in the character class and :, =, ?, #, ~, / are now explicit; the new regex is RFC-3986 correct and well-commented.
nemo_curator/stages/text/filters/heuristic/string.py Adds optional url_regex parameter to UrlsFilter and PornographicUrlsFilter; the string-vs-compiled branching logic is correct.
tests/stages/text/modules/test_filters.py Regression tests cover the HTML-bleed fix, path/query/fragment matching, allowed-character set, and both string and pre-compiled url_regex injection modes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Input text"] --> B["UrlsFilter / PornographicUrlsFilter"]
    B --> C{"url_regex param provided?"}
    C -- "str" --> D["re.compile(url_regex)"]
    C -- "re.Pattern" --> E["use as-is"]
    C -- "None" --> F["use default regex_url"]
    D --> G["self._url_regex"]
    E --> G
    F --> G
    G --> H["self._url_regex.findall(text)"]
    H --> I["score_document()"]
    I --> J["keep_document()"]
Loading

Reviews (12): Last reviewed commit: "Merge branch 'main' into fix/regex-url-o..." | Re-trigger Greptile

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Apr 30, 2026
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SAY-5 thanks for opening! Left some requests.

Comment thread nemo_curator/stages/text/utils/constants.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the parameters as described here: #1601 (comment)

Then tests can be added to the existing pytest file. I do not think we need to add a separate test file here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 4, 2026
Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>
@SAY-5
Copy link
Copy Markdown
Contributor Author

SAY-5 commented May 6, 2026

@sarahyurick addressed both review points:

  • Parameterized regex_url on UrlsFilter and PornographicUrlsFilter (per the issue #1601 thread). Both accept url_regex: re.Pattern | str | None = None and fall back to the project default when omitted, so existing call sites are unaffected.
  • Folded the tests into the existing tests/stages/text/modules/test_filters.py suite alongside test_urls, and removed the standalone test_regex_url.py. New cases cover the HTML-tag/semicolon regression, the allowed-character set, and the new url_regex parameter (string + compiled re.Pattern forms).

The escaped-dash regex itself was already in place from the original commit; left that alone since it produces the same compiled pattern as your suggestion form. Happy to switch to your literal form if you'd prefer that diff.

Thanks for the review!

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you @SAY-5 ! I left some tiny comments.

# `[$-_…]` the regex engine treats `$-_` as a *range* spanning U+0024
# through U+005F, which silently includes `<`, `>`, `;`, `:`, etc., so
# matches bleed past the actual URL into surrounding HTML/punctuation.
# See https://github.com/NVIDIA-NeMo/Curator/issues/1601.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove link please.

# Regression for #1601: previously the `[$-_…]` range matched `<`,
# `>`, `;`, `:`, etc., so a URL match bled past the URL into the
# surrounding HTML/punctuation.
from nemo_curator.stages.text.utils.constants import regex_url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this import to the top of the file instead?

# The fix must not regress on characters the original character
# class intended to allow: letters, digits, `$`, `_`, `@`, `.`,
# `&`, `+`, `-`, `!`, `*`, `(`, `)`, `,`, and percent-encoded escapes.
from nemo_curator.stages.text.utils.constants import regex_url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

def test_urls_filter_accepts_compiled_pattern(self) -> None:
# The custom regex argument should also accept a pre-compiled
# `re.Pattern` instance, not just a string.
import re
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 8, 2026
Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 5661033

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test a144420

@SAY-5
Copy link
Copy Markdown
Contributor Author

SAY-5 commented May 12, 2026

Heads up on the L0 GPU-text failure: pytest output shows 59 passed, 2 skipped, 742 deselected, 34 warnings in 530.91s, then a swigvarlink no __module__ attribute deprecation warning, then SIGSEGV at process exit (exit code 139). The crash is during interpreter shutdown after all stages-text tests passed, and the regex_url change only touches Python regex pattern literals, no C/CUDA/SWIG. Happy to retest if you can /ok to test again.

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 1ffe18e

@sarahyurick
Copy link
Copy Markdown
Contributor

Thanks @SAY-5 no worries, I will make sure the PR gets merged. Thanks for contributing!

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test ee0b57f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Overly permissive url regex

3 participants