Empty links due to validate_url/filter_var

feader · September 21, 2020, 3:19pm

Since the new URL filters were introduced last week, some posts were not processed by some of my plugins. Today, I tried to find out the cause.

It turned out that the URLs of all these posts contained non-latin characters, and filter_var with the FILTER_VALIDATE_URL option, which is called in validate_url, does not consider these URLs valid, so in the line

$entry_link = rewrite_relative_url($site_url, clean($item->get_link()));

in rssutils.php the left hand side is the empty string, which in turn is entered into the DB and caused my plugins to skip the processing of these posts since they check $article['link'] in the article_filter hook.

Example post, which is contained in the r/brasil feed. myfeedsucks shows the correct links, but if it only works with the FeedParser class, it should do that since in the class, validate_url isn’t called, only clean.

Maybe this is an edge case of my PHP installation, and even if it isn’t, I don’t know what to do. Changing the code in validate_url to

if (filter_var($url, FILTER_VALIDATE_URL) === false && filter_var(htmlentities($url), FILTER_VALIDATE_URL) === false)

could work, but I don’t know if that is a good idea (maybe apply htmlentities to the path component only?).

If somebody could confirm that this is not only happening to me, I’d be obliged, otherwise, I’m sorry.

fox · September 21, 2020, 3:35pm

those probably should be urlencode()d but i’m not sure what’s the best place for it and if this is going to affect security somehow.

madmap · September 22, 2020, 9:47am

Got the same Issue: Seems links with special chars (ç, ä, ö, [space], etc…) don’t work anymore: for some feeds I am the originator and url-encode them there, for other I cant.

fox · September 22, 2020, 10:00am

again i suppose the only way is urlencode()ing the URL passed to filter_var() because it specifically rejects all non-ascii characters (and its not wrong, such URLs are technically not valid).

main problem with this is possible feeds with URLs already urlencoded, double urlencode would break things.

pc_principal · September 22, 2020, 10:59am

I can confirm this issue, too.
Thank you, fox, for investigating this issue.

I wrote a small proof of concept code change for the beginning of the validate-Method in UrlHelper class. It should also be able to handle the problem of double-urlencoding. The biggest difference is that parse_url is called before the filter-var function, which is not really recommended by the php documentation. Therefore I would suggest to use a regex as referenced in the code comments, just to be “sure” that there is no shitty malicious input. Certainly this regex would have a performance impact, but it should be not that problem…
Testing this static url validation method could also be a great job of a unit test.

	$url = clean($url);

	# fix protocol-relative URLs
	if (strpos($url, "//") === 0)
		$url = "https:" . $url;
	
    // we should not pass the $url to parse_url without a (basic) check
    // therefore one could use the regex from here: https://urlregex.com/ RFC 3986 compliant
	
	// always decode here
	$tokens = parse_url(rawurldecode($url));
	if (!$tokens['host'])
		return false;

    // this could be checked via regex, too
	if (!in_array(strtolower($tokens['scheme']), ['http', 'https']))
		return false;

	if ($tokens['path']) {
		// encode path, but respect "/" path delimiters
		$tokens['path'] = implode("/", array_map("rawurlencode", explode("/", $tokens['path'])));
	}
	$url = self::build_url($tokens);

    // $url need to contain ascii characters only! otherwise the url is not valid.
	if (filter_var($url, FILTER_VALIDATE_URL) === false) {
		return false;
	}

	if ($extended_filtering) { [...]

fox · September 22, 2020, 11:23am

i’d rather keep scheme checking as is, instead of using regular expressions.

other than that i guess decoding the URL at the beginning should take care of possible double-urlencoding later on.

fox · September 22, 2020, 11:39am

https://git.tt-rss.org/fox/tt-rss/commit/b5710baf3439343bbc65c2fc1586aefd0b94d99c

what about this?

pc_principal · September 22, 2020, 12:20pm

fox, thank you very much and most of the issues with non-ascii URL paths should be solved.

I tested your changes and found another small issue: the punycode/IDNA conversion of host names should be done before calling filter_var, otherwise non-ascii host names are rejected in general. If I move the conversion (l.85-91) just before the filter_var check, everything works as intended in my opinion.

I tested:

IPv4 addresses (with and without port)
IPv6 addresses (with and without port)
URLs with non-ascii host names
URLs with ascii host names
URLs with ascii paths
URLs with non-ascii paths (encoded and decoded)
URLs starting with “//”
several invalid URLs and stuff

fox · September 22, 2020, 12:32pm

ah yes, good catch. this would also remove the need to build_url() twice.

https://git.tt-rss.org/fox/tt-rss/commit/a897c4165bebf975129edfb75ba3878f1deab9ec

GregThib · September 22, 2020, 3:47pm

Seems not.

For example

someurl/xml_sound_serie?seriename=%22PLAISIR%20D%27OFFRIR%22

once decoded in rawurldecode, have [space] character. And filter_var with FILTER_VALIDATE_URL failed at this point.

fox · September 22, 2020, 4:49pm

yeah the problem is we urldecode the entire URL but then urlencode only the path back, skipping urldecoded query parameters. this needs to be further refined.

fox · September 22, 2020, 4:57pm

https://git.tt-rss.org/fox/tt-rss/commit/8fb2baecdccf02b50f048966b32b4a53541627d4

quickly entering horrible hacks territory