Regex bug in UrlRequest.php
-
In the function extractAllUrls() it’s running a preg_match_all call that should also exclude parentheses and semicolons, not just hashtags (anchors) and question marks (query strings).
Line 1225 of UrlRequest.php:
preg_match_all( '/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\? ]+/i', // find this $this->_response['body'], // in this $matches // save matches into this array )Otherwise HTML like this will be crawled:
style="background-image: url(http://www.example.com/wp-content/uploads/2018/08/image.jpg);"… and return
http://www.example.com/wp-content/uploads/2018/08/image.jpg);including the parentheses and semicolon. This of course causes 404 errors in the static HTML output. Fortunately it’s a simple fix in the regex pattern:'/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\?); ]+/i'Thanks!
The topic ‘Regex bug in UrlRequest.php’ is closed to new replies.