Unexpected behavior with <figure> tag

  • [X] I’m using stock docker compose setup, unmodified.
  • [ ] I’m using docker compose setup, with modifications (modified .yml files, third party plugins/themes, etc.) - if so, describe your modifications in your post. Before reporting, see if your issue can be reproduced on the unmodified setup.
  • [ ] I’m not using docker on my primary instance, but my issue can be reproduced on the aforementioned docker setup and/or official demo.

Unfortunately I’m not sure which version of TT-RSS I upgraded from, which could make this rather miserable. I can tell you it came from the old git docker-compose.yml, instead of the newer, static file/method and hadn’t been updated since before that point. Postgres was still on 12.

I have a bunch of feeds from cohost.org. I switched my docker-compose.yml to the new format on the wiki and upgraded to v23.12-14ad8b21 last night/today. Afterward, I noticed that all cohost feeds weren’t displaying images.

I checked the network tab in Firefox and Chrome and found that the images were not being sent in the content block to TT-RSS, nor were the figure tags that they’re enclosed in. They also do not appear in the Android app. I checked in ttrss_entries and determined that the HTML is present in the database. With a bit of a struggle, I tried to track down what might be causing the trouble, but could not find the exact culprit.

Cohost places images inside of a block. Their usage is basic and appears to be valid HTML, as far as I can tell. A random sample feed can be found at: silas of trees Their entry titles suck a bit, but at the moment the second entry (“last day to submit for the Great SNAKE FARM Speedrunning Contest!”) contains an image that should be displayed and is not.

Sanitizer.php @ line 158 indicates that figure elements should be allowed by TT-RSS.

hide_images is false in the request, strip images is false in preferences, the feeds aren’t set to hide images, and I duplicated the issue with the demo instance. I checked with Safe Mode to make sure that TT-RSS plugins weren’t affecting, and tried in Microsoft Edge with uBlock Origin disabled.

Cohost has issues with invalid characters in the feed tags, so I had previously slapped together a script to run them through that sanitizes those tags. Bad behavior, I know, but I contacted them to try to fix it first. When I used preg_replace to remove and , the images returned. I also checked the feeds without sanitizing and on the demo instance.

I’m like 85% certain this all adds up to something going wrong with TT-RSS rendering the entry or stripping elements, but admit that I may be completely wrong because I can’t find the problem. If I missed something completely obvious, I apologize.

It looks like this is a result of https://gitlab.tt-rss.org/tt-rss/tt-rss/-/commit/d4da4dcc321ca65fb2cd19877f395cc5f75933ab .

<meta charset="UTF-8"> is (seemingly) getting identified as something belonging in <head></head>, so (using the entry with ID https://cohost.org/silasoftrees/post/3842668-post from the aforementioned feed) $doc->loadHTML('<meta charset="UTF-8">' . $res) results in something like:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
  <meta charset="UTF-8">
  <figure>...</figure>
</head>
<body>(remaining content that starts with <p>...</p>, so maybe attempted detection of body content)</body>
</html>

Sanitizer::strip_harmful_tags() then strips <head> (as a tag that’s not allowed) and you end up with missing content.

It doesn’t look like this happens with div, span, etc. as the first element in the content. Haven’t done much testing beyond that.

Switching back to <?xml encoding="UTF-8"> fixes the issue, and would be easy, but maybe https://gitlab.tt-rss.org/tt-rss/tt-rss/-/merge_requests/12 is still a concern.

diff --git a/classes/Sanitizer.php b/classes/Sanitizer.php
index 7af92f249..a7bea9e5f 100644
--- a/classes/Sanitizer.php
+++ b/classes/Sanitizer.php
@@ -72,7 +72,7 @@ class Sanitizer {
                $res = trim($str); if (!$res) return '';
 
                $doc = new DOMDocument();
-               $doc->loadHTML('<meta charset="UTF-8">' . $res);
+               $doc->loadHTML('<?xml encoding="UTF-8">' . $res);
                $xpath = new DOMXPath($doc);
 
                // is it a good idea to possibly rewrite urls to our own prefix?

so the tags are being somehow rearranged into <head> out of <body>? this doesn’t look good. we should probably revert that MR then. :frowning:

Sorry for the regression and thanks for investigation. I proposed a possible fix at https://gitlab.tt-rss.org/tt-rss/tt-rss/-/merge_requests/15

thanks, i’ll try to take a closer look tomorrow, likely @wn would also have some input :slight_smile:

You may want a solution that casts a less specific net, unfortunately. Today before I rolled my server back to 2c7e0001 I noticed that error messages from RSS-Bridge also weren’t being displayed in TT-RSS. They do appear in content-preview, but the “content” for that entry is empty. If I had to guess, it’s because the entire error message is in a “section” tag.

it would help if you could add a unit test that fails in your case.

Here is a common pattern: tags like <figure> or <section> are new tags introduced in HTML5. On the other hand, libxml2 does not support HTML5 yet [1]. Maybe that’s the reason behind incorrectly-handled tags.

[1] HTML5 support (#211) · Issues · GNOME / libxml2 · GitLab

@yan12125 @linoth i’m going to revert this for the time being - https://gitlab.tt-rss.org/tt-rss/tt-rss/-/commit/67012f9dac7de22615b72be93fa360f53fefe3ec/pipelines?ref=master

  • we don’t ship affected libxml in the docker image (@wn_name we don’t, right? i haven’t checked alpine 3.19)
  • as far as i figured, libxml devs have fixed the bug which caused the issue in the first place
  • i don’t like tags randomly moving around in the parsed dom tree, this could cause hard to diagnose issues
  • i don’t see how meta tag is a better approach than xml declaration

Correct-- Alpine 3.19.x is currently on libxml2 2.11.6-r0.