Readability got me banned from a server?

Hey all,

Fox, if this post is inappropriate, I apologize. I wasn’t sure where else to ask.

I use Readability on many of my feeds without any problems. However, I’ve recently found myself locked out of one feed in particular. I reached out to the site on Twitter, and they said that they banned my IP because I was making too many requests. All my feeds are set to fetch once per hour, so I was a bit confused by that. Then I thought maybe it was Readability that was hitting their server so much.

Has anyone else experienced being banned/rate limited because of Readability?

this sounds incredibly petty. if these people even noticed you might be their only visitor in the first place. one they obviously don’t care for enough.

It makes sense because if the site published, for example, 20 new articles and you use readability, that would be 21 requests (at least) within a few seconds… More if you have cached media enabled.

But… It’s not like any of this can kill a server. If a server is having difficulty managing that, it’s probably too low-spec for that web site anyway.

There are some solutions but none are elegant. You might be able to get another IP from your provider and rotate them, or copy the plugin and add in some code to put a delay between the requests.

e: The number of requests TT-RSS would make would be well below that of a DOS attack, so this person’s reaction is odd.

stuff like that is easily handled via frontend rate-limiting.

not that you would ever notice this in any actual IRL scenario unless your only passion in life is reading httpd logs of your website that nobody ever visits.

personally i would :face_with_raised_eyebrow: and unsubscribe.

Some sites are looking for content scrapers and that might be their concern with this kind of limit.

Hi all,

This is the site owner here :smiley: I thought I’d clarify what the issue is in the hope that there are some settings that can be tweaked within Readability both for the OP and for any future potential issues on the site - and also to double check my feed is being generated correctly!

The issue I was having was that the server was under a lot of pressure and due to the nature of my site I suspected that it was being targeted by scrapers. While looking through the access logs and looking into what was potentially causing the issue I noticed over 150 requests from a single IP address within a very short space of time - far too many for it to be a human clicking links in an RSS reader (the first hit was always the feed and the UserAgent was TT RSS). This seemed a lot of requests so I looked through the full month’s logs for October and noticed that there were over 22,000 - over 700 page loads per day. This seemed excessive so I blocked the IP address (along with a few others that looked to be programatically scraping the site’s content based on the nature of the URLs being attempted).

The OP contacted me so we started looking into the cause. Now, to me, an RSS reader would usually keep a log of what articles it has ‘seen’ and only look for newer ones whereas in this case it seemed to be pulling in all 140 or so articles each hour. So my question is, is this the expected behaviour of the Readability plugin or should it, as I believe, have a cache or log of what GUIDs it has already processed etc?

I am also wondering if it is an issue with my RSS feed? The first couple of entries are below (I’ve edited out the site details and URLs so it’s not classed as advertising on a first post!!)

<channel>
<title>Site Name</title>
<link>https://the.website.info</link>
<description>RSS feed for new additions over the last 5 days to the website</description>
<language>en-us</language>
<pubDate>Tue, 12 Nov 2019 01:07:09 GMT</pubDate>
<lastBuildDate>Tue, 12 Nov 2019 01:07:09 GMT</lastBuildDate>
<managingEditor>[email protected] (Name)</managingEditor>
<webMaster>[email protected] (Name)</webMaster>
	<item>
		<title>12th Nov: Title and Info</title>
		<link>https://the.website.info/info/123456</link>
		<description>Snippet from the post.<![CDATA[<br /><img src="https://the.website.info/images/theimage.jpg" />]]></description>
		<pubDate>Tue, 12 Nov 2019 01:07:09 GMT</pubDate>
		<guid>https://the.website.info/info/123456</guid>
	</item>
	<item>
		<title>12th Nov: Another Title and Info</title>
		<link>https://the.website.info/info/654321</link>
		<description>Snippet from this post.<![CDATA[<br /><img src="https://the.website.info/images/theimage2.jpg" />]]></description>
		<pubDate>Mon, 11 Nov 2019 21:09:11 GMT</pubDate>
		<guid>https://the.website.info/info/654321</guid>
	</item>

etc etc

The main pubDate and lastBuildDate are set the same as the most recent item in the feed. And each item in the feed has the relevant pubDate.

Thanks in advance for any information.

post your actual feed URL. neither readability nor any other plugins should run unless ttrss thinks that article has been updated.

however, ttrss doesn’t just rely on pub date because it’s unreliable, instead it calculates a hash of an entire article.

if something in your articles is changing dynamically, you might run into this problem.

Hi Fox,

It’s not letting me share a link - I assume because I’m new here? I’m sure you can decode:

The feed is: usa dot newonnetflix dot info slash feed

The data in the articles / items doesn’t change. From 1pm GMT any new items are added every 2 minutes. This could be between 0 and 100 per day. Usually by 2pm GMT these will be complete and the feed won’t change at all until the following day. Even if the feed does change in between TT-RSS polling it there will only be new items - but the previous items in the feed will not have changed in any way (although anything older than 5 days would no longer be in the feed)

If a new item in the feed was detected would that tell the app/service to check everything or just the new item(s)?

Hope that makes sense?

@Matthew,

I’ve update your account, you should be able to post links.

TT-RSS uses the guid tag to uniquely identify a feed item, or the link tag if guid is missing. It absolutely does check this each time the feed is fetched for the express purpose of knowing whether it has already added each article to its database.

I took a look at the feed you posted above and everything seems in order. There should be no reason for TT-RSS hitting your site and fetching the same articles over and over again.

I agree that’s a little ridiculous for a single user.

@Reader_Refugee,

What do you have your purge articles setting at? Please check the individual feed, global setting, and your config.php file. If the purge article setting is too low TT-RSS will drop articles out of the database only to fetch them again later.

Also, do you have any other plugins enabled? If so, which ones?

only new items, of course. unless the previous articles also were modified somehow during this update OR their guid (in your feeds’ case, link) has changed.

so far i’m not seeing anything wrong with your feed.

one minor issue is not returning HTTP Last-Modified which prevents conditional requests (HTTP If-Modified-Since) from working, you might want to look into that if you want to minimize RSS reader traffic.

Do you mean the RSS feed or the article pages themselves?

RSS feed, of course.

Cheers - now added :slight_smile:

yep, last-modified is there now but you’re not returning HTTP 304 not modified on a conditional request:

[13:13:41/19351] last unconditional update request: 2019-11-13 13:13:36
[13:13:41/19351] not using CURL due to open_basedir restrictions
[13:13:41/19351] stored last modified for conditional request: Tue, 12 Nov 2019 22:07:09 GMT
[13:13:41/19351] fetching [https://usa.newonnetflix.info/feed] (force_refetch: )...
[13:13:41/19351] fetch done.
[13:13:41/19351] source last modified: Tue, 12 Nov 2019 22:07:09 GMT

the transcript should look like this instead:

[13:14:38/20146] start
[13:14:38/20146] local cache will not be used for this feed
[13:14:38/20146] last unconditional update request: 2019-11-13 07:57:56
[13:14:38/20146] not using CURL due to open_basedir restrictions
[13:14:38/20146] stored last modified for conditional request: Thu, 15 Aug 2019 11:07:30 GMT
[13:14:38/20146] fetching [https://fakecake.org/testfeeds/random.xml] (force_refetch: )...
[13:14:38/20146] fetch done.
[13:14:38/20146] source last modified: Thu, 15 Aug 2019 11:07:30 GMT
[13:14:38/20146] unable to fetch: HTTP/1.1 304 Not Modified [304]
[13:14:38/20146] source claims data not modified, nothing to do.

by the way, i’ve just run your feed manually again and there was one new (or updated) item:

[13:11:52/18280] start
[13:11:53/18280] local cache will not be used for this feed
...
[13:11:53/18280] processing articles...
[13:11:53/18280] guid 2,https://usa.newonnetflix.info/info/81034946 / SHA1:429215afba4703cb273af127ef31d3b8f95be216
[13:11:53/18280] orig date: 1573596429
[13:11:53/18280] title 13th Nov: Maradona in Mexico (2020), Limited Series [TV-MA] (6/10)
[13:11:54/18280] link https://usa.newonnetflix.info/info/81034946
[13:11:54/18280] language en
[13:11:54/18280] author 
[13:11:54/18280] looking for tags...
[13:11:54/18280] tags found: 
[13:11:54/18280] done collecting data.
[13:11:54/18280] article hash: 22c03ba0648e517b7778b90d1a4c035558a0fb30 [stored=]
[13:11:54/18280] hash differs, applying plugin filters:
[13:11:54/18280] ... Af_Comics
[13:11:54/18280] === 0.0000 (sec)
[13:11:54/18280] ... Af_GoodShowSir
[13:11:54/18280] === 0.0003 (sec)
[13:11:54/18280] ... Af_Psql_Trgm
[13:11:54/18280] === 0.0011 (sec)
[13:11:54/18280] ... Af_Readability
[13:11:54/18280] === 0.0000 (sec)
[13:11:54/18280] ... Af_RedditImgur
[13:11:54/18280] === 0.0000 (sec)
[13:11:54/18280] ... Af_Tumblr_1280
[13:11:54/18280] === 0.0000 (sec)
[13:11:54/18280] ... Auto_Assign_Labels
[13:11:54/18280] === 0.0016 (sec)
[13:11:54/18280] ... Af_Img_Phash
[13:11:54/18280] === 0.0000 (sec)
[13:11:54/18280] ... Af_Video_Fill_Poster
[13:11:54/18280] === 0.0002 (sec)
[13:11:54/18280] plugin data: af_comics,af_goodshowsir,af_psql_trgm,af_readability,af_redditimgur,af_tumblr_1280,auto_assign_labels,af_img_phash,af_video_fill_poster,
[13:11:54/18280] date 1573596429 [2019/11/12 22:07:09]
[13:11:54/18280] num_comments: 0
[13:11:54/18280] force catchup: 
[13:11:54/18280] base guid [2,https://usa.newonnetflix.info/info/81034946 or SHA1:429215afba4703cb273af127ef31d3b8f95be216] not found, creating...
[13:11:54/18280] base guid found, checking for user record
[13:11:54/18280] initial score: 0 [including plugin modifier: 0]
[13:11:54/18280] user record not found, creating...
[13:11:54/18280] resulting RID: 9952107, IID: 6596795
[13:11:54/18280] article updated, but we're forbidden to mark it unread.
[13:11:54/18280] assigning labels [other]...
[13:11:54/18280] assigning labels [filters]...
[13:11:54/18280] looking for enclosures...
[13:11:54/18280] article processed
[13:11:54/18280] guid 2,https://usa.newonnetflix.info/info/81078466 / SHA1:8c2c97526f4c5cd17b27b6ef9d1a1a5b398cdcd8
[13:11:54/18280] orig date: 1573520829
[13:11:54/18280] title 12th Nov: Jeff Garlin: Our Man In Chicago (2019), 58m [TV-MA] (6/10)
[13:11:54/18280] link https://usa.newonnetflix.info/info/81078466
[13:11:54/18280] language en
[13:11:54/18280] author 
[13:11:54/18280] looking for tags...
[13:11:54/18280] tags found: 
[13:11:54/18280] done collecting data.
[13:11:54/18280] article hash: ef23215fddee52f6045b878eab54ed7627c4c435 [stored=ef23215fddee52f6045b878eab54ed7627c4c435]
[13:11:54/18280] stored article seems up to date [IID: 9951104], updating timestamp only
[13:11:54/18280] guid 2,https://usa.newonnetflix.info/info/80178941 / SHA1:c307fd8089b0314282140a649d57caeed532ba2d
[13:11:54/18280] orig date: 1573506551
[13:11:54/18280] title 12th Nov: Harvey Girls Forever! (2019), 3 Seasons [TV-Y7] - New Episodes (6.35/10)
[13:11:54/18280] link https://usa.newonnetflix.info/info/80178941
[13:11:54/18280] language en
[13:11:54/18280] author 
[13:11:54/18280] looking for tags...
[13:11:54/18280] tags found: 
[13:11:54/18280] done collecting data.
[13:11:54/18280] article hash: db7a7edf7f33cc4b075356b9b05c3b70209ed633 [stored=db7a7edf7f33cc4b075356b9b05c3b70209ed633]
[13:11:54/18280] stored article seems up to date [IID: 9951105], updating timestamp only
[13:11:54/18280] guid 2,https://usa.newonnetflix.info/info/81070963 / SHA1:1ecac424811f3af42cba53aa2c62fe69f68b005f
[13:11:54/18280] orig date: 1573456151
[13:11:54/18280] title 11th Nov: Chief of Staff (2019), 2 Seasons [TV-14] - New Episodes (6.95/10)
[13:11:54/18280] link https://usa.newonnetflix.info/info/81070963
[13:11:54/18280] language en
[13:11:54/18280] author 
[13:11:54/18280] looking for tags...
[13:11:54/18280] tags found: 
[13:11:54/18280] done collecting data.
[13:11:54/18280] article hash: d64547edb5e3011415f788cc6b18d8aed9110ebb [stored=d64547edb5e3011415f788cc6b18d8aed9110ebb]
[13:11:54/18280] stored article seems up to date [IID: 9951106], updating timestamp only
...

as you can see, plugins like Readability are only applied to the new item, everything else is skipped. so it looks like your feed is working properly.

Hmm, I’ll have to look into that as I’m not sure, off-hand, how I would do that programatically when the feed is generated.

Yes, one new addition today.

I’m guessing then it could be down to settings in Readability - the caching that was mentioned earlier int he thread.

Thanks for the info and checking things.

readability doesn’t do any caching. plugins are simply not run, at all, if the article is considered up to date. it’s just skipped during update process.

whatever issue OP is having is not related to normal update process nor your feed contents. like @JustAMacUser posted above it could be his aggressive purging settings.

if (if-modified-since request header == timestamp of latest article)
     return http 304 (and don't generate any content)

Sorry, I meant purging not caching :slight_smile:

I’ve now implemented the 304 status based on the if-modified-since header. Thanks for the advice on that.

One other question on that though, how is that header sent in TT-RSS? Does it use the date it last checked or the date of the most recent article that it logged in the database?

oh at first i thought i could just send the latter. and then i’ve encounter all the broken servers.

which is why tt-rss now stores Last-Modified verbatim and sends it to the server back on the next request. which seems to work alright, for the most part. which is why tt-rss also forces unconditional requests periodically just in case server is broken or misconfigured.

it’s really terrible when you think about it.

@JustAMacUser

My settings for purging are set at 30 in prefs, 0 in config. My thinking being that I’m reading daily, not tryna start a library. Should I be purging less frequently?

@fox Thanks for putting in the extra effort to get this sorted. I know you’re busy and I’m grateful for your time.

There’s something that’s causing TT-RSS to refetch all these articles. What other plugins are you using?