-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Special-case alternation with anchors in Regex analysis and code gen #64097
Description
From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:
\d{5}$|\d{5}-\d{4}$
with the same end anchor at the end of each branch. In such cases, we could refactor this into:
(?:\d{5}|\d{5}-\d{4})$
making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:
\d{5}(?:|-\d{4})$
It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.
^\d{15}|\d{18}$
Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:
FindFirstChar() => return true iff at the beginning;
Go()
{
// Try to match first branch. If it succeeds, match.
// Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
// Try to match second branch. If it succeeds, match. Else, fail.
}This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.