Search for URL in article contents

nodiscc · October 18, 2023, 2:29pm

[ ] I’m using stock docker compose setup, unmodified.
[ ] I’m using docker compose setup, with modifications (modified .yml files, third party plugins/themes, etc.) - if so, describe your modifications in your post. Before reporting, see if your issue can be reproduced on the unmodified setup.
[x] I’m not using docker on my primary instance, but my issue can be reproduced on the aforementioned docker setup and/or official demo.

Hi,

I often need to search for URLs or domain names in feed contents.

For example, I just added a filter to auto-mark as read articles containing v.redd.it in a certain category because I’m not interested in video posts for this category. No problem, Settings > Filters > Create filter > Add rule > v\.redd\.it on Content in MY_CATEGORY > Add action: Mark as read > Test > Filter works.

Now, I want to mark as read all previous articles matching this rule (because the filter will only be applied to articles fetched in the future, see Possibility to run (selected) filters manually). So I figured out the simplest way to do this would be searching for v.redd.it in All articles and scrolling down until there are no more unread items.

But the search returns no results. I’ve tried various combinations including escaping \. dots, switching the Language of the search, etc. without success. Searching for redd alone returns too many false positives.

Other times I’m simply trying to search for articles referring to a specific website/domain name and I have the same problem. Searching for e.g. numerama.com seems to work when the text numerama.com is present in the article content, but not for example when it is in the href= of a link but the link text is something else.

So how would I search for all articles that contain a link to example.com?

Thanks

Tiny Tiny RSS version (including git commit id): master/89f5af62d8d6c71043eb855a333acaca333ef297
Platform (i.e. Linux distro, Docker, PHP, PostgreSQL, etc) versions: Debian 12, php-fpm 8.2.7-1~deb12u1, postgresql 15.3-0+deb12u1

fox · October 18, 2023, 4:48pm

i’m afraid i have bad news: https://gitlab.tt-rss.org/tt-rss/tt-rss/-/blob/master/classes/rssutils.php?ref_type=heads#L1185

tags are stripped in the text full text search index is generated from.

e: maybe something like this GitHub - soundasleep/html2text: A PHP component to convert HTML into a plain text format instead of strip_tags() would work better.

fox · October 18, 2023, 5:19pm

if someone is feeling brave, here’s the feature branch pipeline with strip_tags() replaced with html2text:
https://gitlab.tt-rss.org/tt-rss/tt-rss/-/pipelines/297

nodiscc · October 19, 2023, 12:14pm

Thank for the reply fox, I will try to test the patch sometime soon.

Do you think it’s worth adding a new composer dependency just for this? It looks simple enough (420 lines of PHP) but not particularly actively maintained. But yeah, I don’t see any other solution.

I see the search index generation runs as part of update_rss_feed(), will already fetched articles be reindexed once the feed is fetched again with the patch applied?

fox · October 19, 2023, 12:34pm

why not? it’s a third party library, i’d like to have a uniform approach for those, if possible, instead of stuffing everything into lib/ or w/e like before. i’m using composer autoloader anyway so.

there used to be a CLI command to do that

nodiscc · October 19, 2023, 12:41pm

there used to be a CLI command to do that

$ sudo -u tt-rss /usr/bin/php /var/www/rss.example.org/update.php --help|grep index
  --gen-search-idx                                 generate basic PostgreSQL fulltext search index

I will report back once I get some time to test the patch. Thanks again

fox · October 19, 2023, 12:44pm

wait

https://gitlab.tt-rss.org/tt-rss/tt-rss/-/blob/master/update.php?ref_type=heads#L320

this needs to use html2text instead of strip_tags(). oops.

fox · October 19, 2023, 12:54pm

fix: https://gitlab.tt-rss.org/tt-rss/tt-rss/-/commit/9d07d37b6dae01abb080ca2958bcf78054324fe9

you will also need to drop existing index:

update ttrss_entries set tsvector_combined = null;

fox · October 21, 2023, 7:53am

https://gitlab.tt-rss.org/tt-rss/tt-rss/-/commit/03e956132d4a4b880d4e4533aeab725b0b2b5b52

i’ve tested it a bit, it seems to work correctly, so it’s in master now.

nodiscc · October 25, 2023, 7:23pm

Also works for me, well done. All links started being indexed after upgrading to the latest version.

As expected, searching for URLs did not return old articles, I’ve tried regenerating the index without dropping it first, as expected, no changes, and no old articles in search results:

$ sudo -u tt-rss /usr/bin/php /var/www/rss.example.org/update.php --gen-search-idx
[15:34:37/52879] Lock: update.lock
Generating search index (stemming set to English)...
Articles to process: 0 (will limit to 500).
Processed 0 articles...
All done.

So I decided to drop and regenerate the index:

# slow disk, many articles
# takes about 7 minutes
$ sudo -u postgres psql --dbname=ttrss --command='update ttrss_entries set tsvector_combined = null;'
UPDATE 145233
# takes 4+ hours
[15:48:23/55432] Lock: update.lock
Generating search index (stemming set to English)...
Articles to process: 145233 (will limit to 500).
Processed 500 articles...
Processed 1000 articles...
Processed 1500 articles...
[... after a few hours ...]
Processed 601000 articles...
Processed 601500 articles...
^C

As you can see it goes well over the initial article count (600k while it found 145233 articles to index when I ran the command) and never seems to finish. I can now find old articles in search results, but I’m not sure indexing is actually finished. So I decided to stop it and run it again:

$ sudo -u tt-rss /usr/bin/php /var/www/rss.example.org/update.php --gen-search-idx
[19:10:43/133229] Lock: update.lock
Generating search index (stemming set to English)...
Articles to process: 144733 (will limit to 500).
Processed 500 articles...
Processed 1000 articles...
...

While it is still indexing I ran a few checks:

ttrss=# select count(*) from ttrss_entries;
 count  
--------
 145233

ttrss=# select count(*) from ttrss_entries where tsvector_combined is null;
 count  
--------
 144233
(1 row)

ttrss=# select count(*) from ttrss_entries where tsvector_combined is not null;
 count 
-------
  1000
(1 row)

So (will limit to 500) implies that a maximum of 500 articles are indexed every time --gen-search-idx is run? If that is the case, how can I reindex all of the 145k articles? And what does the Processed N articles messages mean?

Edit: am now reading https://gitlab.tt-rss.org/tt-rss/tt-rss/-/blob/master/update.php?ref_type=heads#L296 and trying to understand what is happening.

Edit2: I think there may be something wrong with loop in the indexing procedure. It seems to loop on the the first 500 selected items indefinitely.

Edit3: as a workaround I added a break; after the print "Processed $processed articles...\n"; line and will run the command in a loop until it finishes indexing everything (while true; do sudo -u tt-rss /usr/bin/php /var/www/rss.example.org/update.php --gen-search-idx; done). You might want to check the loop, but anwyay thanks fox, very appreciated enhancement.

fox · October 26, 2023, 11:08am

the loop might be entirely broken, i’ll make a note to take a look.