Empty links due to validate_url/filter_var

Since the new URL filters were introduced last week, some posts were not processed by some of my plugins. Today, I tried to find out the cause.

It turned out that the URLs of all these posts contained non-latin characters, and filter_var with the FILTER_VALIDATE_URL option, which is called in validate_url, does not consider these URLs valid, so in the line

$entry_link = rewrite_relative_url($site_url, clean($item->get_link()));

in rssutils.php the left hand side is the empty string, which in turn is entered into the DB and caused my plugins to skip the processing of these posts since they check $article['link'] in the article_filter hook.

Example post, which is contained in the r/brasil feed. myfeedsucks shows the correct links, but if it only works with the FeedParser class, it should do that since in the class, validate_url isn’t called, only clean.

Maybe this is an edge case of my PHP installation, and even if it isn’t, I don’t know what to do. Changing the code in validate_url to

if (filter_var($url, FILTER_VALIDATE_URL) === false && filter_var(htmlentities($url), FILTER_VALIDATE_URL) === false)

could work, but I don’t know if that is a good idea (maybe apply htmlentities to the path component only?).

If somebody could confirm that this is not only happening to me, I’d be obliged, otherwise, I’m sorry.

those probably should be urlencode()d but i’m not sure what’s the best place for it and if this is going to affect security somehow.

Got the same Issue: Seems links with special chars (ç, ä, ö, [space], etc…) don’t work anymore: for some feeds I am the originator and url-encode them there, for other I cant.

again i suppose the only way is urlencode()ing the URL passed to filter_var() because it specifically rejects all non-ascii characters (and its not wrong, such URLs are technically not valid).

main problem with this is possible feeds with URLs already urlencoded, double urlencode would break things.

I can confirm this issue, too.
Thank you, fox, for investigating this issue.

I wrote a small proof of concept code change for the beginning of the validate-Method in UrlHelper class. It should also be able to handle the problem of double-urlencoding. The biggest difference is that parse_url is called before the filter-var function, which is not really recommended by the php documentation. Therefore I would suggest to use a regex as referenced in the code comments, just to be “sure” that there is no shitty malicious input. Certainly this regex would have a performance impact, but it should be not that problem…
Testing this static url validation method could also be a great job of a unit test.

	$url = clean($url);

	# fix protocol-relative URLs
	if (strpos($url, "//") === 0)
		$url = "https:" . $url;
	
    // we should not pass the $url to parse_url without a (basic) check
    // therefore one could use the regex from here: https://urlregex.com/ RFC 3986 compliant
	
	// always decode here
	$tokens = parse_url(rawurldecode($url));
	if (!$tokens['host'])
		return false;

    // this could be checked via regex, too
	if (!in_array(strtolower($tokens['scheme']), ['http', 'https']))
		return false;

	if ($tokens['path']) {
		// encode path, but respect "/" path delimiters
		$tokens['path'] = implode("/", array_map("rawurlencode", explode("/", $tokens['path'])));
	}
	$url = self::build_url($tokens);

    // $url need to contain ascii characters only! otherwise the url is not valid.
	if (filter_var($url, FILTER_VALIDATE_URL) === false) {
		return false;
	}

	if ($extended_filtering) { [...]

i’d rather keep scheme checking as is, instead of using regular expressions.

other than that i guess decoding the URL at the beginning should take care of possible double-urlencoding later on.

https://git.tt-rss.org/fox/tt-rss/commit/b5710baf3439343bbc65c2fc1586aefd0b94d99c

what about this?

fox, thank you very much and most of the issues with non-ascii URL paths should be solved.

I tested your changes and found another small issue: the punycode/IDNA conversion of host names should be done before calling filter_var, otherwise non-ascii host names are rejected in general. If I move the conversion (l.85-91) just before the filter_var check, everything works as intended in my opinion.

I tested:

  • IPv4 addresses (with and without port)
  • IPv6 addresses (with and without port)
  • URLs with non-ascii host names
  • URLs with ascii host names
  • URLs with ascii paths
  • URLs with non-ascii paths (encoded and decoded)
  • URLs starting with “//”
  • several invalid URLs and stuff

ah yes, good catch. this would also remove the need to build_url() twice.

https://git.tt-rss.org/fox/tt-rss/commit/a897c4165bebf975129edfb75ba3878f1deab9ec

Seems not.

For example

someurl/xml_sound_serie?seriename=%22PLAISIR%20D%27OFFRIR%22

once decoded in rawurldecode, have [space] character. And filter_var with FILTER_VALIDATE_URL failed at this point.

yeah the problem is we urldecode the entire URL but then urlencode only the path back, skipping urldecoded query parameters. this needs to be further refined.

https://git.tt-rss.org/fox/tt-rss/commit/8fb2baecdccf02b50f048966b32b4a53541627d4

quickly entering horrible hacks territory