Skip to content

Special-case alternation with anchors in Regex analysis and code gen #64097

@stephentoub

Description

@stephentoub

From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:

\d{5}$|\d{5}-\d{4}$

with the same end anchor at the end of each branch. In such cases, we could refactor this into:

(?:\d{5}|\d{5}-\d{4})$

making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:

\d{5}(?:|-\d{4})$

It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.

^\d{15}|\d{18}$

Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:

FindFirstChar() => return true iff at the beginning;
Go()
{
   // Try to match first branch.  If it succeeds, match.
   // Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
   // Try to match second branch.  If it succeeds, match. Else, fail.
}

This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions