Tiny Tiny RSS: Community

Problem with <img srcset>

(Posting in Unsupported just to be safe)

When updating this feed with image caching enabled, it tries to to download lots and lots of invalid urls and spams the server with invalid requests.

Steps to reproduce:

  1. Add this feed
  2. enable “check media” in the options for this feed
  3. wait for it to update or do it manually (or rather don’t! it’s a wonder they didn’t ban me automatically)

tt-rss version (including git commit id): v20.05 (d63329baa1)

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

  • Arch Linux
  • PHP 7.4.6
  • PostgreSQL 12.2
  • Apache 2.4.43
  • PHP-FPM 7.4.6

Basically, <img srcset> syntax is complicated. I first thought my feed was broken (it contains URLs in srcset which include commas) but after testing my browsers (Firefox 78.0a1) behavior and reading the W3 spec closely I think it is valid after all because why wouldn’t it be. Putting JSON in srcset would have been about as overengineered but at least fairly easy to handle.

As I’m interpreting it (from point 2):

A valid non-empty URL that does not start or end with a U+002C COMMA character (,), referencing a non-interactive, optionally animated, image resource that is neither paged nor scripted.

A URL may actually contain a comma as long as it’s not at the start or the end.

I also wouldn’t be opposed to removing srcset entirely but as the feed is following the spec I don’t think it’s fair to spam them with (invalid) requests as it does now (also in this case one url gets split into 4 making it even worse). And (I think) it’s always going to get a 404 it’s going to retry every time it updates.

This ugly regex seems to work, but there may be some edge-cases I missed:

$matches = [];
preg_match_all(
  '/(?:\A|,)\s*(?P<url>(?!,)\S+(?<!,))\s*(?P<size>\s\d+w|\s\d+(?:\.\d+)?(?:[eE][+-]?\d+)?x|)\s*(?=,|\Z)/',
  $srcset, $matches, PREG_SET_ORDER
);
foreach ($matches as $m) {
  printf("url: %s\nsize: %s\n", $m["url"], $m["size"]);
}
Explanation of regex
  • (?:\A|,)\s*: each “image candidate string” starts with either a “,” or the start of the string “\A” followed by any number of whitespace characters
  • (?P<url>(?!,)\S+(?<!,)): then a url which neither starts (?!,) not ends (?<!,) with a comma and doesn’t contain spaces [probably more permissive than it should be]
  • then any number of whitespace characters
  • then optionally a size specifier which must if present start with a whitespace character and then either
    • something like 20w
    • or something like 1.23e-45x
  • more optional whitespace
  • (?=,|\Z): it must be followed by either end of string or a comma

thanks for this post. scrcset is definitely over-engineered (just like modern web-related stack in general, i guess) and i didn’t want to bother with it initially because of stuff like this.

can you file a PR? i’m not sure if regexp is the best way, especially a complicated one like this, but i can’t think of any cleaner alternatives right now.

After taking a quick look at this, I’m wondering if using preg_split() might not be a simpler choice? It will basically function like explode() but with the added benefit of being able to define when to use the comma as a delimiter.

Just thinking out loud…

Not sure, preg_split() might not work. Take e.g. http://foo/bar,baz 20w,http://foo/bar. What would you split on in this case? A [wx] before the comma? Would that work?

Fwiw:

If we were to go with my regex:

  • where to put it? In functions.php? currently it’s all over the place in diskcache.php, rssutils.php, and functions.php
  • should /x be used to comment the regex? python documentation if you don’t know what it is

E: just reading through my links I want less and less to do with this by the minute
E: here’s a (fairly self-contained) parser in JS if you prefer that… I don’t think I’m going to feel like implementing that in PHP any time soon and you probably wouldn’t even accept it because who wants to maintain that.

it’s all over the place because i thought it would be “split a string by comma”. now that i know how bad things actually are, i think a method should be added somewhere (rssutils?) to return a list of urls based on srcset string. that would use the regexp or something else.

i’m thinking that regexp would be good enough.

preg_split( '/(?<=[wx])\s*,/i', $srcset );

Not saying this is perfect, but I tried a few different srcset’s and it seemed to work just fine.

preg_split( ‘/(?<=[wx])\s*,/i’, $srcset );

It fares a lot better than I thought it would but if you want a simple counterexample: foo-800w,400h.jpg

it’s all over the place because i thought it would be “split a string by comma”

well yeah, I didn’t mean that in a bad way for the way it is now. I only meant

  • if it’s any more complicated than it is now it definitely wouldn’t duplicate it 2/3 times
  • I really wouldn’t know where to put it. I mean it doesn’t strictly have anything to do with RSS so why rssutils?

i think most of the stuff related to dealing with srcset is already there, anyway it doesn’t really matter. we can always move it later.