fix(filters): escape `-` in regex_url character class (#1601) by SAY-5 · Pull Request #1880 · NVIDIA-NeMo/Curator

SAY-5 · 2026-04-28T08:48:06Z

Closes #1601.

Summary

The regex_url pattern in nemo_curator/stages/text/utils/constants.py had the unescaped class [$-_@.&+]. Python's re engine reads $-_ as a range from U+0024 ($) through U+005F (_), which silently includes <, >, ;, :, =, etc.
Result: regex_url.findall("see http://x.com<bad> end") previously returned ['http://x.com<bad>'] instead of ['http://x.com']. The pattern is used by UrlsFilter and RepeatingTopNGramsFilter (called transitively via repeating_n_gram paths), so this bleed inflates the urls_ratio score on documents that interleave URLs with HTML or punctuation.
Fix: escape the - ([$\-_@.&+]) so it is treated as a literal character. The class now contains exactly the set of URL-safe characters the original author intended.

Behavior change

Characters that the buggy range accidentally admitted (<, >, ;, :, =, ASCII 0x25, 0x5F that weren't otherwise listed) are no longer captured. Characters the original code intentionally listed ($, _, @, ., &, +, -, !, *, (, ), ,, percent-encoded escapes) all still match.

Test plan

Added tests/stages/text/utils/test_regex_url.py with four regression cases:
- <bad> after a URL no longer extends the match,
- ;next after a URL no longer extends the match,
- all explicitly-allowed characters still match,
- both http:// and https:// continue to match.
Verified the new tests fail against the unfixed regex and pass against the fixed one (standalone Python re reproduction; full pytest run requires the Ray/cosmos_xenna stack which CI will exercise).

Authorship

Signed off per DCO. The patch is one line; the regression tests document why.

The `regex_url` pattern used in heuristic string filters contained the unescaped class `[$-_@.&+]`. Python's re engine reads `$-_` as a *range* spanning U+0024 ($) through U+005F (_), which silently includes `<`, `>`, `;`, `:`, `=`, and other punctuation. Matches therefore bled past the end of legitimate URLs into surrounding HTML/punctuation: re.findall(URL_RE, 'see http://x.com<bad> end') # before: ['http://x.com<bad>'] # after: ['http://x.com'] Escaping the `-` makes the regex engine treat it as a literal char, so the class is exactly the set of URL-safe characters the original author intended ([$_@.&+]). Adds regression tests under tests/stages/text/utils/test_regex_url.py covering: angle-bracket bleed, semicolon bleed, retention of all explicitly-allowed characters, and parity for `http`/`https`. Signed-off-by: SAY-5 <say.apm35@gmail.com>

copy-pr-bot · 2026-04-28T08:48:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-28T08:49:24Z

Greptile Summary

This PR fixes a silent regex bug in regex_url where the unescaped [$-_...] character class was interpreted as a Unicode range (U+0024–U+005F), causing URL matches to bleed into adjacent HTML/punctuation. The fix escapes - and also adds :, =, ?, #, ~ — characters that were previously only matched as accidental side-effects of the broken range but are genuine RFC 3986 URL characters. A url_regex injection hook is also added to both UrlsFilter and PornographicUrlsFilter.

constants.py: Replaces the broken pattern with r"https?://(?:[A-Za-z0-9$\\-_@.&+/:=?#~]|[!*\$\$,]|(?:%[0-9A-Fa-f]{2}))+". The escaped -, explicit /, and added :=?#~ address all issues surfaced in the issue thread.
string.py: UrlsFilter and PornographicUrlsFilter each grow an optional url_regex parameter (string or compiled re.Pattern) so callers can substitute a stricter or looser pattern without subclassing.
test_filters.py: Four regression tests cover the bleeding-HTML case, path/query/fragment matching, the full intended allowed-character set, and both string and pre-compiled pattern injection.

Confidence Score: 5/5

The change is safe to merge — the regex fix is correct, all previously-missing URL characters are now explicitly listed, and the customisation hook is well-guarded.

The regex fix is correct: - is properly escaped, and all characters previously captured only as an accidental side-effect of the broken range (/, :, =, ?, #) are now explicitly included. The url_regex injection handles the string/compiled-pattern/None cases correctly. Regression tests cover the key before/after behaviors.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/utils/constants.py	Core fix: `-` is escaped in the character class and `:`, `=`, `?`, `#`, `~`, `/` are now explicit; the new regex is RFC-3986 correct and well-commented.
nemo_curator/stages/text/filters/heuristic/string.py	Adds optional `url_regex` parameter to `UrlsFilter` and `PornographicUrlsFilter`; the string-vs-compiled branching logic is correct.
tests/stages/text/modules/test_filters.py	Regression tests cover the HTML-bleed fix, path/query/fragment matching, allowed-character set, and both string and pre-compiled `url_regex` injection modes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Input text"] --> B["UrlsFilter / PornographicUrlsFilter"]
    B --> C{"url_regex param provided?"}
    C -- "str" --> D["re.compile(url_regex)"]
    C -- "re.Pattern" --> E["use as-is"]
    C -- "None" --> F["use default regex_url"]
    D --> G["self._url_regex"]
    E --> G
    F --> G
    G --> H["self._url_regex.findall(text)"]
    H --> I["score_document()"]
    I --> J["keep_document()"]

_{Reviews (12): Last reviewed commit: "Merge branch 'main' into fix/regex-url-o..." | Re-trigger Greptile}

sarahyurick

Hi @SAY-5 thanks for opening! Left some requests.

sarahyurick · 2026-05-04T17:54:35Z

Please add the parameters as described here: #1601 (comment)

Then tests can be added to the existing pytest file. I do not think we need to add a separate test file here.

Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>

SAY-5 · 2026-05-06T21:27:16Z

@sarahyurick addressed both review points:

Parameterized regex_url on UrlsFilter and PornographicUrlsFilter (per the issue #1601 thread). Both accept url_regex: re.Pattern | str | None = None and fall back to the project default when omitted, so existing call sites are unaffected.
Folded the tests into the existing tests/stages/text/modules/test_filters.py suite alongside test_urls, and removed the standalone test_regex_url.py. New cases cover the HTML-tag/semicolon regression, the allowed-character set, and the new url_regex parameter (string + compiled re.Pattern forms).

The escaped-dash regex itself was already in place from the original commit; left that alone since it produces the same compiled pattern as your suggestion form. Happy to switch to your literal form if you'd prefer that diff.

Thanks for the review!

sarahyurick

This is great, thank you @SAY-5 ! I left some tiny comments.

sarahyurick · 2026-05-07T21:37:26Z

+# `[$-_…]` the regex engine treats `$-_` as a *range* spanning U+0024
+# through U+005F, which silently includes `<`, `>`, `;`, `:`, etc., so
+# matches bleed past the actual URL into surrounding HTML/punctuation.
+# See https://github.com/NVIDIA-NeMo/Curator/issues/1601.


Remove link please.

sarahyurick · 2026-05-07T21:37:40Z

+        # Regression for #1601: previously the `[$-_…]` range matched `<`,
+        # `>`, `;`, `:`, etc., so a URL match bled past the URL into the
+        # surrounding HTML/punctuation.
+        from nemo_curator.stages.text.utils.constants import regex_url


Can you move this import to the top of the file instead?

sarahyurick · 2026-05-07T21:37:59Z

+        # The fix must not regress on characters the original character
+        # class intended to allow: letters, digits, `$`, `_`, `@`, `.`,
+        # `&`, `+`, `-`, `!`, `*`, `(`, `)`, `,`, and percent-encoded escapes.
+        from nemo_curator.stages.text.utils.constants import regex_url


Same comment as above.

sarahyurick · 2026-05-07T21:38:38Z

+    def test_urls_filter_accepts_compiled_pattern(self) -> None:
+        # The custom regex argument should also accept a pre-compiled
+        # `re.Pattern` instance, not just a string.
+        import re


Same comment as above.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>

sarahyurick · 2026-05-08T21:20:39Z

/ok to test 5661033

sarahyurick · 2026-05-11T21:44:19Z

/ok to test a144420

SAY-5 · 2026-05-12T01:30:16Z

Heads up on the L0 GPU-text failure: pytest output shows 59 passed, 2 skipped, 742 deselected, 34 warnings in 530.91s, then a swigvarlink no __module__ attribute deprecation warning, then SIGSEGV at process exit (exit code 139). The crash is during interpreter shutdown after all stages-text tests passed, and the regex_url change only touches Python regex pattern literals, no C/CUDA/SWIG. Happy to retest if you can /ok to test again.

sarahyurick · 2026-05-12T16:56:36Z

/ok to test 1ffe18e

sarahyurick · 2026-05-12T16:58:05Z

Thanks @SAY-5 no worries, I will make sure the PR gets merged. Thanks for contributing!

sarahyurick · 2026-05-12T19:39:17Z

/ok to test ee0b57f

SAY-5 requested a review from a team as a code owner April 28, 2026 08:48

SAY-5 requested review from huvunvidia and removed request for a team April 28, 2026 08:48

github-actions Bot added the community-request label Apr 28, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Apr 30, 2026

sarahyurick requested changes May 4, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 4, 2026

address review: parameterize url_regex, fold tests into existing suite

9abd3fe

Signed-off-by: Sai Asish Y <saiasish.cnp@gmail.com>

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026

sarahyurick reviewed May 7, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 8, 2026

test(filters): address remaining review nits

4004624

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>

sarahyurick approved these changes May 8, 2026

View reviewed changes

Merge branch 'main' into fix/regex-url-overly-permissive-range-1601

5661033

copy-pr-bot Bot temporarily deployed to test May 8, 2026 21:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 21:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 11, 2026 20:24 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 11, 2026 20:24 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 11, 2026 20:24 Inactive

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 11, 2026

Merge branch 'main' into fix/regex-url-overly-permissive-range-1601

a144420

copy-pr-bot Bot temporarily deployed to test May 11, 2026 21:49 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 11, 2026 21:49 Inactive

Merge branch 'main' into fix/regex-url-overly-permissive-range-1601

1ffe18e

Merge branch 'main' into fix/regex-url-overly-permissive-range-1601

ee0b57f

Conversation

SAY-5 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior change

Test plan

Authorship

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick May 4, 2026

Choose a reason for hiding this comment

Uh oh!

SAY-5 commented May 6, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented May 8, 2026

Uh oh!

sarahyurick commented May 11, 2026

Uh oh!

SAY-5 commented May 12, 2026

Uh oh!

sarahyurick commented May 12, 2026

Uh oh!

sarahyurick commented May 12, 2026

Uh oh!

sarahyurick commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SAY-5 commented Apr 28, 2026 •

edited

Loading

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading