Problem with a probably broken feed

raopheefah · September 16, 2022, 3:22pm

Hi! I am struggling with a feed which I really need.
I’m sure that the feed maintainer has messed things up, but from our political situation the only outcome which I can achieve by contacting them is that the feed will be stopped completely. So please help me.
From the plaintext XML view the feed looks good, but it causes problems with tt-rss.
I wrote a plugin and it helped for some time, but since September 12 something got broken and I am receiving corrupt articles again.

[X] I’m using stock docker compose setup, unmodified.
Tiny Tiny RSS version (including git commit id): v22.09-d47b8c8
Platform (i.e. Linux distro, Docker, PHP, PostgreSQL, etc) versions: psql (PostgreSQL) 12.12
Linux ttrss-1 5.10.0-16-cloud-amd64 #1 SMP Debian 5.10.127-1 (2022-06-30) x86_64 GNU/Linux

The feed I am dealing with is election committee.
The “Test your feed with tt-rss parser” page says it is completely okay and parses correct articles.
But when I load this feed into my ttrss, I get corrupt article links:
http://mosgorizbirkom.ru/web/guest/http://mosgorizbirkom.ru/main?p_p_id=101&p_p_lifec...
As you see, all the links which can be seen correct in RSS XML source get loaded into tt-rss preceeded with this strange http:// mosgorizbirkom.ru/web/guest/
Where does it come from? Why is it here? I got no idea but it was there for years as I’ve been using tt-rss.

Okay, in July I wrote a plugin which replaced double http header with single one and it worked until September 08, but it is broken now. They changed the way they mess links but now I fixed my plugin.
Now I see both right and wrong URLs in my database.
Articles loaded after my fix have correct links.
How does that work? Why is the plugin not replacing links?
Does the plugin not fix previoulsy loaded articles?
Is there a good way to fix broken ones in the database?
Should I develop and run UPDATE in docker exec -it ttrss-docker_db_1 psql -U postgres or is there a better way?
I checked all the boxes at the debugger page, still no changes in database.

Questions aggregated:

Why are the article links broken?
Does the plugin not change repaired links for previoulsy loaded articles?
Do you see a good way to fix p.2 ?

Please help. Thank you.

fox · September 16, 2022, 3:47pm

if you look at feed XML you’ll see this:

<link href="http&#x26;&#x23;x3a&#x3b;&#x26;&#x23;x2f&#x3b;&#x26;&#x23;x2f&#x3b;mosgorizbirkom&#x26;&#x23;x2e&#x3b;ru&#x26;&#x23;x2f&#x3b;main&#x26;&#x23;x3f&#x3b;p_p_id&#x26;&#x23;x3d&#x3b;101&#x26;&#x23;x26&#x3b;p_p_lifecycle&#x26;&#x23;x3d&#x3b;0&#x26;&#x23;x26&#x3b;p_p_state&#x26;&#x23;x3d&#x3b;maximized&#x26;&#x23;x26&#x3b;p_p_mode&#x26;&#x23;x3d&#x3b;view&#x26;&#x23;x26&#x3b;_101_struts_action&#x26;&#x23;x3d&#x3b;&#x26;&#x23;x25&#x3b;2Fasset_publisher&#x26;&#x23;x25&#x3b;2Fview_content&#x26;&#x23;x26&#x3b;_101_assetEntryId&#x26;&#x23;x3d&#x3b;24606985&#x26;&#x23;x26&#x3b;_101_type&#x26;&#x23;x3d&#x3b;content&#x26;&#x23;x26&#x3b;_101_urlTitle&#x26;&#x23;x3d&#x3b;obzor-pressy-za-10-sentabra-2022-go-1"/>

that’s completely broken, yes. you get links prefixed by site URL because whatever this is, it’s not a valid absolute URL.

raopheefah · September 16, 2022, 4:40pm

Thank you!

What has to be there? Unescaped http://url instead of this http&#x3a;&#x2f;&#x2f; ?
Is there a good way to fix links for articles already downloaded in the database?
I have no access to the server config. Do you think that my approach to change
$article["link"] = str_replace( "http://mosgorizbirkom.ru/web/guest/http://mosgorizbirkom.ru/", "http://mosgorizbirkom.ru/", $article["link"] );
is good enough?

fox · September 16, 2022, 4:58pm

that’s not going to work. you’ll need to decode html entities in the URL and go from there. otherwise you’ll just get a broken link.

i’d make a plugin which would target this specific feed while using HOOK_FEED_FETCHED or something.

personally i would write a script (python or w/e) to do something like that.

article URL.

raopheefah · September 16, 2022, 5:04pm

Which type of script do you mean? An SQL UPDATE query in Psql database?
My str_replace HOOK_ARTICLE_FILTER works fine as it corrects new articles, but older ones still stay corrupt.

fox · September 16, 2022, 5:08pm

yeah, maybe you could decode html entities with sql alone but i wouldn’t bother figuring out how.

i’m surprised that this ends up as something clickable.

raopheefah · September 16, 2022, 5:14pm

I can perfectly deal with decoding.
Could you please respond: is there a good way to update links for loaded articles without going to the database?
Articles can get updated, yes? Links, no?

fox · September 16, 2022, 5:19pm

there’s no way to update stuff that’s not in the feed XML. you’ll need to deal with the database.

everything in the XML would get updated if you use force rehash in the feed debugger.

raopheefah · September 16, 2022, 5:38pm

Yes, I did rehash+refetch. New articles arrive with corrected links, but older ones are still corrupt:

Thank you! I’ll do a database update.

fox · September 16, 2022, 5:49pm

no amount of rehashing would process data that’s not in the feed XML anymore.

raopheefah · September 16, 2022, 6:04pm

Thank you very much for your help!
Now I understand what’s going on.
The backup and
UPDATE ttrss_entries set link=REPLACE(link, 'http://mosgorizbirkom.ru/web/guest/http://mosgoriz', 'http://mosgoriz' ) where link like '%guest/http%';
statement fixed everything that was wrong.