Link that breaks make_clickable()

andy.carver
(@andycarver)

10 years, 4 months ago

Hi,

I’m a great fan of make_clickable(). So it grieves me to report that it breaks on the following real, live URL:

http://www.legislature.mi.gov/(S(u4kr0jxgdjux4q3eqpsx030i))/mileg.aspx?page=getobject&objectname=2015-HJR-BB&query=on

Why is this? I think it would be great if this could be remedied. In the meantime, does anyone know a workaround?

Thanks!

Viewing 12 replies - 1 through 12 (of 12 total)

(@anevins)

WCLDN 2018 Contributor | Volunteer support

It doesn’t look like you’re using real spaces, or you copied a string from somewhere like email and it contained nbsp spaces.

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

“Real spaces” where? This is not copied out of email, nor typed it is a real, working URL (on the Michigan State legislature’s website). And I don’t see any spaces in it, encoded or converted or otherwise.

Andrew Nevins

(@anevins)

WCLDN 2018 Contributor | Volunteer support

10 years, 3 months ago

I must be looking at the wrong sentence, which sentence shows the issue on that page you linked?

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

I may not have made the issue clear. It’s that when I feed this URL to the WP function make_clickable(), the function returns the URL with only an initial part of it “clickable” (hyperlinked).

It closes the hyperlink somewhere around one of the parentheses that appear in the early part of the URL. Though I’m a newbie at RegEx, this makes me suspect that the opening or closing parenthesis is breaking the RegEx…

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

I think I might have found the problem, within the source code for that function. About half-way thru the source, you’ll see this comment:

# Unroll the Loop: Only allow puctuation URL character if followed by a non-punctuation URL character

Now, it so happens that the code classifies a closing parenthesis bracket ‘)’ as a punctuation character. Thus, the code is specifically set up not to allow two of them in sequence ‘))’ such as we find in this URL 🙁

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

<aside>Rather ironically, it classifies ‘(‘ as non-punctuation, while it classifies ‘)’ as punctuation…</aside>

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

Yes, that’s the problem — when you adjust the RegEx to accept ‘)’ likewise as “Non-punctuation”, it handles the above URL just fine.

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

I guess they wanted to avoid including a ‘)’ that closed a pair of brackets surrounding the whole URL. Good idea, but the solution leaves something to be desired… like, accepting the above URL 🙁

I wonder if RegEx can count how many ‘(‘s have occurred, and not let any more than that number of ‘)’s occur? I’m too much of a newbie to know, but that would seem an ideal solution…

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

Some things I’ve come across suggest recursion is the key for matching, possibly nested brackets:

for example, see http://www.regular-expressions.info/recurse.html.

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

Silly me: the problem they were addressing wasn’t how to prevent extra ‘)’s from being included at the end of the URL. They had already addressed that, very specifically, by counting ‘(‘s and ‘)’s in their text-replacing callback function ( _make_url_clickable_cb() ).

But, that leaves me even more puzzled: Having taken care of that problem, WHY shouldn’t they include ) in the list of characters one of which must follow any “punctuation” character? Which fixes the problem with the above URL…

Here’s the place in make_clickable()’s code I’m talking about; the single change I’m suggesting — inserting a ) — is on the next to last line here (sorry for the wordwrap):

[\\w\\x80-\\xff#%\\~/@\\[\\]*(+=&$-]*+         # Non-punctuation URL character
(?:                                            # Unroll the Loop: Only allow puctuation URL character if followed by a non-punctuation URL character
    [\'.,;:!?)]                                # Punctuation URL character
    [\\w\\x80-\\xff#%\\~/@\\[\\]*(+=&$)-]++    # Non-punctuation URL character --PLUS )
)*

Thread Starter

andy.carver

(@andycarver)

10 years, 3 months ago

But that cheap and easy solution also leave something to be desired, because any closing punctuation is accepted if it’s followed by a ) .

So… let’s go for the recursive solution. Here is mine .. with much debt to Jan Goyvaerts’s website!

The following code block is what you should replace the old code with inside make_clickable() — I have tried to add enough of the context to find the location in the original:

$url_clickable = '~
                                ([\\s(<.,;:!?])                                        # 1: Leading whitespace, or punctuation
                                (                                                      # 2: URL
                                        [\\w]{1,20}+://                                # Scheme and hier-part prefix
                                        (?=\S{1,2000}\s)                               # Limit to URLs less than about 2000 characters long

# here begins a recursive solution, based on plugging the original (herewith-replaced) RegEx code (see next line) into an idea nicked from JAN GOYVAERTS:

                                        (?:[\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-](?:[\'.,;:!?][\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-]++)*)*+        # basically the original, less ( and ), quantified
                                    	(?\'brackets\'
                                    		\(                                                                                               # Opening round bracket, of a proper pair
                                    			(?:
                                    				[\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-](?:[\'.,;:!?][\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-]++)*|(?P>brackets)    # recursive subroutine call
                                    			)*
                                    		\)                                                                                               # Closing round bracket, of a proper pair
                                            (?:[\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-](?:[\'.,;:!?][\\w\\x80-\\xff#%\\~/@\\[\\]*+=&$-]++)*)*+    # another insertion of the basically original code
                                    	)*
                                )
# this line is NOW SUPERFLUOUS  (\)?)                                                  # 3: Trailing closing parenthesis (for parethesis balancing post processing)
# end of recursive solution
# ... BUT, you MUST now use my new, drastically shortened version of the callback function _make_url_clickable_cb(), next ... I call it _my_make_url_clickable_cb() ... ;)
                        ~xS'; // The regex is a non-anchored pattern and does not have a single fixed starting character.
                              // Tell PCRE to spend more time optimizing since, when used on a page load, it will probably be used several times.

                        $ret = preg_replace_callback( $url_clickable, '_my_make_url_clickable_cb', $ret );

And here is the new, drastically shortened version of the callback function you must use:

function _my_make_url_clickable_cb($matches) {
	$url = $matches[2];
	$url = esc_url($url);
	if ( empty($url) )
		return $matches[0];

	return $matches[1] . "<a href=\"$url\" rel=\"nofollow\">$url</a>" ;
}

Thread Starter

andy.carver

(@andycarver)

10 years ago

Now I’ve got a more efficient and robust RegEx for my improved make_clickable():

$url_clickable = '~
        ([\\s(<.,;:!?])                                        # 1: Leading whitespace, or punctuation
        (                                                      # 2: URL
                [\\w]{1,20}+://                                # Scheme and hier-part prefix
                (?=\S{1,2000}\s)                               # Limit to URLs less than about 2000 characters long

	# here begins a recursive solution, based somewhat on an idea or two from JAN GOYVAERTS:

		[\w\x80-\xff#%\~@\[\]*+=&$-]++			# Non-punctuation URL character -- excluding / HERE ONLY
		(?:
		 [\'.,;:!?]					# punctuation URL character
		  (?![/@])  					# Let us avoid having one of these right after the punctuation
		 [\w\x80-\xff#%\~/@\[\]*+=&$-]++		# Allow punctuation URL character only if followed by a non-punctuation URL character
		)++
		(?:
		 (
		  \(						# Opening round bracket, of a proper pair
		   (?:
		    [\w\x80-\xff#%\~/@\[\]*+=&$-]++		# Non-punctuation URL character -- including /
		    (?:[\'.,;:!?](?![/@])[\w\x80-\xff#%\~/@\[\]*+=&$-]++)*+
		   |
		    (?3)    				# recursive subroutine call
		   )*+
		  \)					# Closing round bracket, of a proper pair
		 )
		 (?:
		  [\w\x80-\xff#%\~/@\[\]*+=&$-]++
		  (?:[\'.,;:!?](?![/@])[\w\x80-\xff#%\~/@\[\]*+=&$-]++)*+
		 )?+
		)*+
        )
# this line is NOW SUPERFLUOUS  (\)?)                    # 3: Trailing closing parenthesis (for parethesis balancing post processing)
# end of recursive solution
# ... BUT, you MUST now use my new, drastically SHORTENED version of the callback function _make_url_clickable_cb(), next ... I call it _my_make_url_clickable_cb() ... ;)
                        ~xS'; // The regex is a non-anchored pattern and does not have a single fixed starting character.
                              // Tell PCRE to spend more time optimizing since, when used on a page load, it will probably be used several times.

                        $ret = preg_replace_callback( $url_clickable, '_my_make_url_clickable_cb', $ret );

Viewing 12 replies - 1 through 12 (of 12 total)

The topic ‘Link that breaks make_clickable()’ is closed to new replies.

Link that breaks make_clickable()

Tags

Topics

Topics with no replies

Non-support topics

Resolved topics

Unresolved topics

All topics