Is there an easy way to get the URLs of the unread articles

shawn · December 30, 2019, 5:06am

I have a feed that has 1k+ unread articles. My goal is to save all of them into HTML files to read later. I know that’s a long shot, but as a start, how can I get the URLs of them?

Thanks!

shabble · December 30, 2019, 1:49pm


SELECT e.link
FROM
ttrss_entries e
JOIN ttrss_user_entries ue ON e.id = ue.ref_id
WHERE ue.unread=1 
AND ue.feed_id=166

Change that 166 to whatever number appears after f= in the URL when you view the feed:

ttrss_entries.content (or e.content in the context above) has some of the HTML that’s visible when you view articles, so if you have the full-text normally showing you may be able to get away with just pulling those instead of messing with the URL.

rodneys_mission · December 30, 2019, 3:37pm

There shouldn’t be a need to hit the source servers a second time … tt-rss already has the content. Something like this (untested) might work for you:

SELECT
	'<html><head><title>' || en.title || '</title><base href="' || en.link || '" /></head><body>' || en.content || '</body></html>' as HTML

FROM ttrss_user_entries ue

JOIN ttrss_entries en ON (en.id = ue.ref_id)

WHERE ue.feed_id = (SELECT id FROM ttrss_feeds WHERE title = '<your feed name>')
AND ue.owner_uid = (SELECT id FROM ttrss_users WHERE login = '<your user name>')
AND ue.unread IS NOT FALSE

ORDER BY en.date_entered, en.title

Using your favorite scripting language, spool each HTML column into a separate file.

shawn · December 30, 2019, 4:52pm

@shabble
That works! The only change is that “ue.unread=true”

@rodneys_mission
Your SQL also works!

The problem is that the content of the feed is empty. I have to somehow open each URL link in firefox to get the full content and save as HTML file.

I am think about writing a script using autohotkey Any better suggestions?

rodneys_mission · December 30, 2019, 5:20pm

why is your content empty? Is there something wrong with the RSS your feed links to? Is your feed not compliant or are you doing something “tricky”? If the later, maybe look at feediron plugin.
I am a linux and command line guy; I don’t have any advice for windows scripting … except maybe look at cygwin.com and do your scripting with a fsf/gnu tool stack. But pick whatever works for you!

Personally I would use perl or bash to run the sql and save the links, then download them via wget or curl. Maybe ahk has all you need, can’t say.

Inside tt-rss though, if you star or publish these unread articles; they shouldn’t get purged b/f reading them. If they are empty, I’d really recommend looking at feediron. I’d try and stay inside the primary tool (in this case tt-rss) b/f going thru the hoops of writing something that may not be necessary.

YMMV.

shabble · December 30, 2019, 5:35pm

curl or wget, though as noted, it’s strange that content is empty.

shawn · December 30, 2019, 6:59pm

A lot of sites are not for public or for whatever reasons; their RSS doesn’t have full content. For example, most private torrent sites provide RSS, but you have to login, sometimes two-step authentication, to view the content.

FeedIron, which I never used before, seems a very powerful plugin to fetch the full content, but I don’t know if it supports the sites that need login.

I know curl or wget can fetch HTML file, but I don’t think they can download the images inside the HTML file. I do need to see the images in the content of HTML. Sorry I didn’t mention that before.

On the other hand, firefox can save HTML files along with the images in them with no problems. Also, you only need to manually login once and stay in for a long time with Firefox. However, manipulating the browser in scripts like Autohotkey is a challenge for me. So I am open for ideas.

rodneys_mission · December 30, 2019, 7:12pm

Ok, that gives some clarity. But from here on out, this is no longer a tt-rss issue. And since this site is dedicated to that platform alone, you’d be better off asking in a scripting forum. Best of luck.

shawn · December 30, 2019, 9:38pm

@rodneys_mission
Fair enough.

Right now, I am using pgadmin4 to connect to my PostgreSQL in a docker and run the query. After that, I have to save the result into a csv file. It’s quite a pain.

However I can connect to PostgreSQL through Linux SSH like this:

docker exec -it postgres sh

psql -U postgres ttrss

And run the query:

SELECT e.link FROM ttrss_entries e

JOIN ttrss_user_entries ue ON e.id = ue.ref_id

WHERE ue.unread=true AND ue.feed_id=21;

Could you please help me to write a script to dump the query result into a txt file? That would save me a lot of time! Thanks a lot!

JustAMacUser · December 30, 2019, 10:39pm

Does this help?

shawn · December 31, 2019, 12:44am

@JustAMacUser It DOES!

I wonder if it is possible to wrap up everything below into a single script? These are the commands that works for me:

docker exec -it postgres sh
psql -U postgres ttrss
\o /var/lib/postgresql/dump.files/ttrss.url.txt

SELECT e.link FROM ttrss_entries e
JOIN ttrss_user_entries ue ON e.id = ue.ref_id
WHERE ue.unread=true
AND ue.feed_id=21;

\q
exit

rodneys_mission · December 31, 2019, 12:57am

You can generate the eventual command line in the select statement and then grep the commands out. Basically use psql to write your script. I’ve done a lot of similar things b/f. (warning: only untested snippets, implementing is left up to the student)

SELECT 'CMD=1 wget "' || e.link || '"' FROM ...

$ eval `grep CMD=1 /var/lib/postgresql/dump.files/ttrss.url.txt`

“CMD=1” is just a token to recognize wget commands. You could also add a “-O” parameter with (title || ‘.html’) or something like that to name the files better than what the URL may contain. “eval” runs all the wget-es found in the dump file. Insta-scripting.

Use your imagination and google.

shawn · December 31, 2019, 2:41am

@rodneys_mission

SELECT ‘CMD=1 wget "’ || e.link || ‘"’ FROM …

$ eval grep CMD=1 /var/lib/postgresql/dump.files/ttrss.url.txt

My project has two steps:

Save the unread articles’ URL into a file
Open each URL in the list file, and save them as HTML files

I guess your script above is for step 2, using wget to download all the URLs, right? Well, as I mentioned before, it is a no-go. Because I need all the images and css inside the HTML, and Firefox “save as” does it.

For step 2, I found a DOS batch file that can open URLs in firefox tabs:
for /F "eol=c tokens=1" %%i in (%1) do "C:\...\Firefox.exe" %%i
save above as multiurl.bat. and run:
multiurl.bat urls.txt
urls.txt is the list of the URLs.

Then I have an autohotkey script to save all tabs to HTML files.
Not perfect, but works.

For step 1, I need a single linux script to wrap up the commands, which is what I asked in the previous post.

shabble · December 31, 2019, 8:37am

shawn · December 31, 2019, 4:45pm

@shabble Thanks!
I tried this:
wget -p -k https://www.newegg.ca/todays-deals/
It downloaded one file: index.html Also, it converted all the links, so I can open the html file in a browser.
Without -p -k option, wget still downloads the index.html file, but I cannot open it in a browser.
However, it’s hard for wget to support login, CAPTCHA, etc. So I will stick with my Firefox solution.

shawn · December 31, 2019, 5:29pm

I know it’s a Linux thng, but maybe the Linux gurus here can help me out.

My goal is to run the scripts below in one command.

docker exec -it postgres sh
psql -U postgres ttrss
\o /var/lib/postgresql/dump.files/ttrss.url.txt

SELECT e.link FROM ttrss_entries e
JOIN ttrss_user_entries ue ON e.id = ue.ref_id
WHERE ue.unread=true
AND ue.feed_id=21;

\q
exit

I know once I run the first docker command, I need some sort of input method to pass the rest commands.

1st try:
put 2nd command and the rest into file pg.input.txt
docker exec -it postgres sh < pg.input.txt
Error:
the input device is not a TTY

2nd try:
remove -it option:
docker exec postgres sh < pg.input.txt
no error, but no output either

3rd try:
docker exec -it postgres /bin/sh -c "psql -U postgres ttrss;\q"
output:
psql (10.9 (Ubuntu 10.9-1.pgdg18.04+1))
Type "help" for help.
ttrss=#

getting close. The first command psql got passed, but the 2nd one “\q” didn’t.

Any help please?

JustAMacUser · December 31, 2019, 6:58pm

Wouldn’t all this just be easier to setup as a TT-RSS plugin? I mean, you want to read the content and you have the links in the database. Just code a plugin to fetch the full content when it’s updating the feed. It might not work for your older articles* but once it’s setup that would never be an issue again.

*You could run a PHP CLI script that does the older articles, if you wanted.

All of this TT-RSS to get the article links, a bash script that hooks the PosgreSQL docker instance to get the links, that then pushes them into a file, that Firefox then reads, and then a Firefox plugin gets the content and saves it. This seems like the most convoluted process for something that’s relatively simple.

shawn · January 1, 2020, 4:15am

update:
I have reduced the 6 commands to 3. The first to get into the docker, the 2nd to run query and save the result, and the 3rd to exit out of the docker:

docker exec -it postgres sh
/var/lib/postgresql/pg3.sh
exit

In order to do that, I created 3 scripts at /var/lib/postgresql/
pg1.sh
psql -U postgres ttrss

pg2.sh
\o /var/lib/postgresql/dump.files/ttrss.url.txt
SELECT e.link FROM ttrss_entries e
JOIN ttrss_user_entries ue ON e.id = ue.ref_id
WHERE ue.unread=true
AND ue.feed_id=21;
\q
exit

pg3.sh
/var/lib/postgresql/pg1.sh < /var/lib/postgresql/pg2.sh

shawn · January 1, 2020, 4:22am

@JustAMacUser
Wouldn’t all this just be easier to setup as a TT-RSS plugin?

I don’t know much about PHP. I tried the plugin Readability, and it’s a hit and miss. But the biggest reason is that it doesn’t (I don’t think any of those full content plugins can) support private sites that need to sign in, such as private torrent, or other CAPTCHA sites.

rodneys_mission · January 1, 2020, 5:39pm

This is straying way outside TT-RSS domain. Pick up a linux or php for dummies book, or O’Reilly book.

Why instead of saving and then opening the files 1-by-1; just type “O” or click the article headline and read them 1-by-1 from firefox/TT-RSS? I don’t see an advantage to “bulk” downloading your articles. Given this is for torrent sites, it sounds more like scraping directory listings or something we shouldn’t get mixed in with. : /