Tiny Tiny RSS: Community

[Tags] implode tags on comma

Hi all,

There is a weird behaviour on some of my feeds. I used to diagnose why some feeds have a really great number of tags, and I found one explanation.
As we can see with these request, a feed can give more than one tag per item, separated on comma:

>>> wget -qO- 'https://www.mediapart.fr/articles/feed' | grep dc:subject
<dc:subject><![CDATA[harcèlement sexuel, agression sexuelle, Cinéma, Jean-Claude Brisseau, Noémie Kocher, A la Une]]></dc:subject>
<dc:subject><![CDATA[socialisme, Bernie Sanders, Elizabeth Warren, DSA, Etats-Unis, gauche, socialistes]]></dc:subject>
<dc:subject><![CDATA[Roquebrune-sur-Argens, David Rachline, Rassemblement national]]></dc:subject>
<dc:subject><![CDATA[social, Travailleurs détachés, inspection du travail]]></dc:subject>
<dc:subject><![CDATA[Etats-Unis, Arabie Saoudite, gaz de schiste, Mohammed ben Salmane, Changement climatique, pétrole, Aramco, OPEP]]></dc:subject>
<dc:subject><![CDATA[double peine, étrangers, violences sexuelles, Marlène Schiappa]]></dc:subject>
<dc:subject><![CDATA[Espagne, vox, santiago abascal, UE, extrême droite, Catalogne]]></dc:subject>
<dc:subject><![CDATA[Banque européenne dinvestissement, relance, Volkswagen, Italie, CounterBalance, Dieselgate, BEI, prêts, UE]]></dc:subject>
<dc:subject><![CDATA[émeutes, carburant, Iran]]></dc:subject>
<dc:subject><![CDATA[loi de mémoire historique, Chili, Franco, Pinochet, transition démocratique, Espagne]]></dc:subject>

So, when the function get_categories return, it return a unique tag per item, composed of values separated by commas, but not (as expected?) an array of such values.

Moreover, with the mediapart case, the field containing tags (dc:subject) is generated dynamically but not in the same order each time. So, the tags for an article are updated once at each fetch, adding the new creepy big tag to the list.
At the moment, I do have more than 1200 tags just for the Mediapart feed, and because of the lock process in Mysql, have many warnings in my logs (something like “lock exists, transaction retry”)

Maybe this is particular to this feed, because the tags are encapsulated into a CDATA structure?

Possible solution:

  • in classes/feeditem/{atom,rss}.php, after getting the tag list with the xpath query, explode the tags on comma (is it standard?), trim for spaces, and array-ize the result.

What do you think?

(thank you fox and contributors for maintaining a really clear code. Without this, I did not be able to understand the growing of ttrss_tags so quickly)

dc:subject is supposed to be one tag for xml element, i think. i suppose we can split them, it won’t hurt.

i think this should take care of most aforementioned issues.

Nice ! I rolled back my own (dirty) changes and tested yours, and it’s work great.

Thanks alot @fox !