Skip to content

Regex support in Python→Laurel translation#623

Open
ssomayyajula wants to merge 7 commits intomainfrom
feat-laurel-regex
Open

Regex support in Python→Laurel translation#623
ssomayyajula wants to merge 7 commits intomainfrom
feat-laurel-regex

Conversation

@ssomayyajula
Copy link
Contributor

Description of changes:

Add re.compile, re.match, re.search, re.fullmatch for Python->Laurel

  • Laurel prelude: from_regex constructor on Any, re_Match composite type with bodiless methods.
  • Core prelude: re_to_regex normalizer (str|Pattern→regex), re_fullmatch_bool/re_match_bool/re_search_bool with correct anchoring, re_compile, re_fullmatch, re_match, re_search with specs.
  • PyFactory: re_compile_str with concreteEval calling pythonRegexToCore.
  • ReToCore: fixed bug — added LMonoTy type annotations to all regex expression builders so concreteEval output survives the SMT encoder (which rejects unannotated ops). The old pipeline never triggered this because the old Python→Core translator panics on re.match and DiffTestCore round-trips through text which re-annotates everything.
  • LaurelToCoreTranslator: map Laurel UserDefined "regex" to Core regex sort.
  • StrataMain: pass ReFactory to Laurel pipeline's verifyCore.

Test: end-to-end test_regex.py covering fullmatch/match/search with literal and compiled patterns, character classes, quantifiers, alternation, dot, and anchoring differences between modes.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Add re.compile, re.match, re.search, re.fullmatch flowing through
Laurel to Core's SMT-LIB string theory.

Laurel prelude: from_regex constructor on Any, re_Match composite type
with uninterpreted method stubs.

Core prelude: re_to_regex normalizer (str|Pattern→regex),
re_fullmatch_bool/re_match_bool/re_search_bool with correct anchoring,
re_compile, re_fullmatch, re_match, re_search with disjunctive requires.

PyFactory: re_compile_str with concreteEval calling pythonRegexToCore.

ReToCore: fix latent bug — add LMonoTy type annotations to all regex
expression builders so concreteEval output survives the SMT encoder
(which rejects unannotated ops). The old pipeline never triggered this
because the old Python→Core translator panics on re.match and
DiffTestCore round-trips through text which re-annotates everything.

LaurelToCoreTranslator: map Laurel UserDefined "regex" to Core regex sort.
StrataMain: pass ReFactory to Laurel pipeline's verifyCore.

Test: end-to-end test_regex.py covering fullmatch/match/search with
literal and compiled patterns, character classes, quantifiers,
alternation, dot, and anchoring differences between modes.
@ssomayyajula ssomayyajula requested a review from a team March 19, 2026 23:15
match name.uniqueId.bind model.refToDef.get? with
-- Composite types map to "Composite"; datatypes map to their own name.
-- "regex" is a Core builtin sort used by the regex prelude; pass it through.
if name.text == "regex" then .tcons "regex" []
Copy link
Contributor

@keyboardDrummer keyboardDrummer Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Use .TCore "regex" instead of a .Userdefined "regex" type, so you don't need this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that comment is LLM nonsense; I'll just follow the convention of the other builtin Laurel types that map to Core types by DDM ops for them.

// Core-only declarations (not expressed in Laurel)
// =====================================================================

// /////////////////////////////////////////////////////////////////////////////////////
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried putting all of this in the Laurel prelude instead of the Core one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial few functions may have to stay in Core because they call the SMTLIB regex functions, but the functions that call them might be movable to Laurel. Let me look into that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to move them to the Laurel prelude, but getting all asserts with regexes going from pass -> unknown b/c those functions are no longer marked inline. Any existing way to do that in Laurel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any existing way to do that in Laurel?

Not right now. I think I will change it so Laurel will just mark everything inline when using the Python front-end.


-- Type annotations for regex ops. concreteEval runs after the type-checker,
-- so expressions it produces must already carry annotations for the SMT encoder.
private def reTy : Lambda.LMonoTy := .tcons "regex" []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these! Nit: You can use mty[regex] and mty[str → regex], etc. (where mty stands for monotype) to construct these .tcons terms. Note you'll need to open LTy.Syntax.

See Factory.lean for some inspiration, if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2376558.

concreteEval := some
(fun _ args => match args with
| [LExpr.strConst () s] =>
let (expr, maybe_err) := pythonRegexToCore s .fullmatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (also the pre-existing reCompileFunc) has a problem in that it hard-codes .fullmatch (pythonRegexToCore generates different outputs based on the mode). Making this parametric in the mode won't suffice: we'd need to look ahead in the program to see how the output of re.compile is used and provide that appropriate mode here. Too messy.

I think we need a different solution where Python's re.compile is essentially a no-op semantically. However, Python's re.search, re.match, and re.fullmatch can be implemented using such factory functions, with hard-coded match modes.

Also, could you tell me why reCompileStrFunc doesn't return an exception (unlike reCompileFunc)? pythonRegexToCore does model exceptions -- Python-analog patternError and Strata's own unimplemented where we can return a havoced regex. It'd be a shame to lose this info. at the Laurel level.

Copy link
Contributor

@shigoel shigoel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b868c65. Exception checking added in caa351a and tests added in c5156e4

shigoel and others added 5 commits March 21, 2026 16:43
Replace single re_compile_str (hardcoded .fullmatch mode) with three
mode-specific factory functions: re_fullmatch_str, re_match_str,
re_search_str. Each compiles the pattern with the correct MatchMode
so anchors (^/$) are handled properly.

re.compile is now a semantic no-op (returns the pattern string unchanged).
The match mode is applied at the point where matching happens, not at
compile time. This fixes incorrect behavior for patterns with explicit
anchors used across different match modes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add re_pattern_error factory function that surfaces genuinely malformed
  patterns as RePatternError, modeling Python's re.error. Unimplemented
  features return NoError (Python would succeed; sound over-approximation
  comes from mode-specific factory staying uninterpreted).
- Prelude checks re_pattern_error first and returns exception(err) for
  pattern errors before attempting the match.
- Add RePatternError variant to the Laurel Error datatype.
- Remove dead reCompileFunc (PyReCompile) — never wired into either
  pipeline.
- Replace old candidate-translation comment with architecture overview
  documenting the double-parse methodology.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ~130 new test assertions covering anchors across all modes,
  escaped metacharacters, colon patterns, multi-char anchor patterns,
  anchors inside alternation, compile+anchor interaction, and more.
- Add 9 expected-failure tests (currently unknown due to Laurel pipeline
  SAT-finding limitation) documenting properties the solver should
  disprove but cannot.
- Remove from_regex constructor from Any datatype — unused since
  re.compile is now a no-op returning the pattern string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual .tcons construction with mty[regex], mty[string → regex],
etc. for readability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add type annotations (mty[Error], mty[string → Error]) to
  RePatternError/NoError constructors in rePatternErrorFunc concreteEval,
  fixing SMT encoding errors for malformed patterns.
- Add malformed pattern tests in two styles:
  - m != None (passes): proves exception value is produced
  - try/except (unknown): idiomatic Python pattern; VC simplifies to
    (assert true) after concreteEval but quantified prelude axioms make
    SAT-finding intractable for the solver.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants