Regex filters - Examples and ideas

dariottolo · August 27, 2022, 2:03pm

I hope Fox would not mind if I open this discussion.

One of the features I like a lot in TT-Rss is the possibility of using regular expressions to create filters, because it allows me to weed out articles I am not interested in, once I identified a pattern.

Unfortunately I do not have an IT background and regex are quite intimidating, so finding the right combination required me a lot of trial and error.

Below some examples of the rules I use, with the expected results. I would really appreciate your feedback (on how to improve those) or maybe add yours, so that others can benefit.

(best(.*)deals)
this matches anything that starts with best and ends with deals, no matter if there is anything inbetween (for example “return to school” or “daily”). I use this to deal with what is usually sponsored post.

(offert[a|e](.*)amazon)
this is a different take on the above. In Italian, deal is offerta, while deals (plural) is offerte, so this rule allows me to match any post about one offer (offerta) or more (offerte) on amazon

videogame(.s)
this is similar to the above, but for matching singular and plural words in English, because this matches both videogame and videogames

pok[eé]mon
this one matches two different letters within a word. It is usefull in this case because it prevents not filtering a word if it is written differently.

\bpixel\b (\bwatch\b|\bbuds\b)
with this rule you can match one word (pixel in this case) with another, chosen between two different ones (watch and buds). The result is that I can filter anything about pixel watch or pixel watch (which I am not interested in) and keep anything about pixel (the phone), or watch (either as verb or as noun).

[a-z]
this is triggered whenever there is any letter in the pattern. Seems quite extreme, but i needed something to activate the ttrss-catchup-old plugin

Please let me know what you think and if there is any additional regex you use. I think it would be great if you could also include some “real life” examples.

Thank you in advance for your time.

Regards

fox · August 27, 2022, 3:07pm

sounds like a great idea, i think it could be a sticky.

ekalin · August 27, 2022, 6:00pm

Well, since fox didn’t close this topic, let me point out some things:

No need for the parenthesis - you can just use best.*deals. Similarly for the others.

[] is already an alternation, no need for |. This also would match offert| amazon. Harmless, but unnecessary.

This one won’t work, the . matches any single character, so it would match videogamers but not videogames (no characters between e and s) nor videogame (it expects it to be followed by any character and then s). It might match, unintentionally, something like videogame sale because the space satisfies the ..

What you want is videogames?.

This probably works, but since there’s already a space between pixel and the next word, the word boundary condition is redundant, so it can be simplified to:
\bpixel (watch\b|buds\b)
or even
\bpixel (watch|buds)\b
(which might be more legible).

There are many tutorials about regular expressions. Most are probably OK, even if not great.

shabble · August 27, 2022, 6:04pm

videogame(.s)
this is similar to the above, but for matching singular and plural words in English, because this matches both videogame and videogames

You want videogames* there. As pointed out, it doesn’t match the plural, but will match the singular followed by (eg) a space with the next word starting with s.

dariottolo · August 29, 2022, 10:34am

@ekalin @shabble, thank you very much for your precious input.
I really stand by my initial claim that regex are a power tool, but very complicated at the same time.

While I am editing all my rules to incorporate your inputs, may I ask you (or anyone else who want to participate), if you could post some examples of the filters you are most proud of?

Thanks again for your time

shabble · August 29, 2022, 6:21pm

I don’t use them. (In fact my one attempt at trying to use them failed miserably and I didn’t pursue them any further.)

geo5ukr · April 20, 2023, 3:54pm

Hi everyone,

I found this thread while looking how to debug the regular expressions in tt-rss.

To give an example:
^(-?\d+(?:.\d+)?),\s*(-?\d+(?:.\d+)?)$

This should find any decimal lat/long in articles, f.i. 46.167314, 34.811968 should match this.

I say should, if I test it using other tools, it gives good results, but in tt-rss filters somehow it will spit out articles that don’t match this kind of pattern at all.

The XML found https://file.io/P168kjyZrQ5X here is one that was tagged based on this regex, but the behaviour puzzles me…

shabble · April 21, 2023, 12:01pm

404

<!-20characters–>

pahles · April 21, 2023, 12:41pm

Free plan of file.io…

fox · April 21, 2023, 5:00pm

i can suggest two things:

tt-rss uses case-insensitive preg_match() - you can check php docs to see if there’s something it doesn’t support
you can use feed debugger (f D) to see which regexp matches which article

geo5ukr · April 22, 2023, 7:29am

Interesting that the file gets deleted after 1 download…

Here is another attempt: CryptPad

I would appreciate some other regex examples from the community to test my setup and hopefully better understand why certain things don’t give the outcome I anticipated.

Suggestion 1: I will have a look at it.

Suggestion 2: I did that, no matches is happening on articles that had a grid for sure. See below.

[07:25:26/1642] guid 1link deleted because forum says I’m a new user , (hash: {“ver”:2,“uid”:1,“hash”:“SHA1:16dea40f3447a62378ff3aeb02a61c4b571add47”} compat: SHA1:be1269d094c892c1f4dfbe48ac2d0d68b38680bb)
[07:25:26/1642] orig date: 1682062699 (2023-04-21 07:38:19)
[07:25:26/1642] title Re GeoLocation: link deleted because forum says I’m a new user
[07:25:26/1642] link https://twitter.com/GeoConfirmed/status/1649316618353233921
[07:25:26/1642] language zh
[07:25:26/1642] author GeoConfirmed
[07:25:26/1642] looking for tags…
[07:25:26/1642] tags found:
[07:25:26/1642] done collecting data.
[07:25:26/1642] looking for enclosures…
[07:25:26/1642] article hash: baa6ad6b82402eb1a25e355da9ba0f0a1cb8cc86 [stored=baa6ad6b82402eb1a25e355da9ba0f0a1cb8cc86]
[07:25:26/1642] hash differs, running HOOK_ARTICLE_FILTER handlers…
[07:25:26/1642] === 0.0002 (sec) Af_Psql_Trgm
[07:25:26/1642] === 0.0002 (sec) Af_Readability
[07:25:26/1642] === 0.0002 (sec) Af_Img_Phash
[07:25:26/1642] plugin data: af_psql_trgm,af_readability,af_img_phash,
[07:25:26/1642] date: 1682062699 (2023/04/21 07:38:19)
[07:25:26/1642] num_comments: 0
[07:25:26/1642] force catchup:
[07:25:26/1642] base guid found, checking for user record
[07:25:26/1642] initial score: 0 [including plugin modifier: 0]
[07:25:26/1642] user record FOUND: RID: 36849, IID: 36854
[07:25:26/1642] resulting RID: 36849, IID: 36854
[07:25:26/1642] article updated, but we’re forbidden to mark it unread.
[07:25:26/1642] assigning labels [other]…
[07:25:26/1642] assigning labels [filters]…
[07:25:26/1642] resulting article tags:
[07:25:26/1642] article processed.

fox · April 22, 2023, 8:03am

btw you should get more info on filters when you run the debugger at LOG_EXTENDED level.

geo5ukr · April 22, 2023, 11:20am

Ok thx, I see a different output on this. One part gives a match, see below. But the [reg_exp] => ^(-?\d+(?:.\d+)?),\s*(-?\d+(?:.\d+)?)$ did not give a match, although every post of this handle has lat/long in it. I checked this regex on regex101.com, where it matches perfectly. So I assume that tt-rss (read php pcre) has something in it that I haven’t figured out…

[11:11:59/2944] matched filters:
Array
(
[0] => Array
(
[id] => 6
[match_any_rule] =>
[inverse] =>
[rules] => Array
(
[0] => Array
(
[reg_exp] => geoconfirmed
[type] => both
[inverse] =>
)

            )

        [actions] => Array
            (
                [0] => Array
                    (
                        [type] => label
                        [param] => geoconfirmed
                    )

            )

    )

)
[11:11:59/2944] matched filter rules:
Array
(
[0] => Array
(
[reg_exp] => geoconfirmed
[type] => both
[inverse] =>
)

)
[11:11:59/2944] filter actions:
Array
(
[0] => Array
(
[type] => label
[param] => geoconfirmed
)

)
[11:11:59/2944] date: 1682084437 (2023/04/21 13:40:37)
[11:11:59/2944] num_comments: 0
[11:11:59/2944] article labels:
Array
(
[0] => Array
(
[0] => -1030
[1] => geoconfirmed
[2] =>
[3] =>
)

)
[11:11:59/2944] force catchup:

ekalin · April 22, 2023, 11:55am

You’re anchoring the regexp at the beginning (^) and end ($). This means it will match coordinates, but only if there is nothing else around the coordinates.

I don’t know what string is passed by tt-rss when matching, it it’s the whole text or line by line. (And some regexp libraries treat ^ and $ as beginning and end of a line in multi-line texts), but even in the best scenario (line by line), it would only match if a line consisted only the the coordinates.

So try without the ^ and $.

geo5ukr · April 24, 2023, 8:10am

will do, thanks for the feedback

geo5ukr · April 24, 2023, 8:14am

SUCCES!