Make sanitizeGpxContent work on long input strings#96
Open
StBr65 wants to merge 1 commit intojulien-nc:mainfrom
Open
Make sanitizeGpxContent work on long input strings#96StBr65 wants to merge 1 commit intojulien-nc:mainfrom
StBr65 wants to merge 1 commit intojulien-nc:mainfrom
Conversation
Change the regular expression so the matched and captured text will be as short as possible, and the expression only matches <time> and </time> tags that are related. In the original version, the expression will match and capture a string between, most often unrelated <time> and </time> tags because of the .* (dot asterisk) in the expression. This results in big captures requiring increased value of the pcre.backtrack_limit PHP setting to succeed and a huge number of <time> to </time> tag combinations are tried out. When preg_replace internally iterates over the input string, the .* (dot asterisk) construct will match from the first <time> tag to the last </time> tag, and from the second <time> tag to the last </time> tag, and from the third to the last and so forth, and then it can start all over from the first <time> tag to the second last </time> tag, and from the second <time> tag to the second last </time> tag and so forth, until all combinations of <time> and a subsequent </time> tag have been tried. In a file with 30.000 time elements, it results in over 450 million tries (30.000+29.999+29.998...). It takes very long time and is needless. The suggested change ensures the text matched and captured is only the text between related <time> and </time> tags, and the text matched and captured is less than 50 characters in each iteration. With the suggested change the expression will not match the characters [ and < after the <time> tag, which stops the matching and capturing at the firstcomming timezone name, or the related </time> if a timezone name is not present in the element. Because only related <time> and </time> tags are tried, it results in 30.000 tries in a file with 30.000 time elements instead of 450 millions. On my server with at high value of pcre.backtrack_limit in the PHP settings, a request to handle a gpx file with 30.000+ time elements gives up with a time out after an hour. With the suggested change the same file is processed in less than a second, and it works with the default pcre.backtrack_limit value. I believe it will actually work with any size of input because no capture is more than 50 characters, and the test for null result should not be needed, but I have not tested if it is correct.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change the regular expression so the matched and captured text will be as short as possible, and the expression only matches and tags that are related.
In the original version, the expression will match and capture a string between, most often unrelated and tags because of the .* (dot asterisk) in the expression. This results in big captures requiring increased value of the pcre.backtrack_limit PHP setting to succeed and a huge number of to tag combinations are tried out.
When preg_replace internally iterates over the input string, the .* (dot asterisk) construct will match from the first tag to the last tag, and from the second tag to the last tag, and from the third to the last and so forth, and then it can start all over from the first tag to the second last tag, and from the second tag to the second last tag and so forth, until all combinations of and a subsequent tag have been tried. In a file with 30.000 time elements, it results in over 450 million tries (30.000+29.999+29.998...). It takes very long time and is needless.
The suggested change ensures the text matched and captured is only the text between related and tags, and the text matched and captured is less than 50 characters in each iteration.
With the suggested change the expression will not match the characters [ and < after the tag, which stops the matching and capturing at the firstcomming timezone name, or the related if a timezone name is not present in the element. Because only related and tags are tried, it results in 30.000 tries in a file with 30.000 time elements instead of 450 millions.
On my server with at high value of pcre.backtrack_limit in the PHP settings, a request to handle a gpx file with 30.000+ time elements gives up with a time out after an hour. With the suggested change the same file is processed in less than a second, and it works with the default pcre.backtrack_limit value. I believe it will actually work with any size of input because no capture is more than 50 characters, and the test for null result should not be needed, but I have not tested if it is correct.