Earth Notes: On Website Technicals (2024-05)

Updated 2024-05-16.
Tech updates: RED 406 and 429, conditional paranoia, mod_dumpio, time hash, mod_log_forensic, Sec-CH-UA-Mobile, regex remove, slow rate limit, NTP...
The RSS efficiency side-quest continues...

2024-05-16: ntp1 is Dead, Long Live ntp2!

I continue to get a huge amount of NTP traffic, ie many hits per second, though my server has not been stratum-1 for years. I suspect that my server name has been baked into software in routers, etc. So I have removed ntp1.exnet.com from DNS to see if that reduces the flow a little. I may rename ntp.exnet.com too, if the deluge continues!

A tiny snippet from tcpdump:

09:25:06.712266 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48
09:25:06.727073 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48
09:25:06.729648 IP XXXX > 79.135.97.78.123: NTPv3, Client, length 48
09:25:06.730250 IP 79.135.97.78.123 > XXXX NTPv3, Server, length 48
09:25:06.734552 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48
09:25:06.750494 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48
09:25:06.750879 IP 79.135.97.78.123 > XXXX NTPv4, Server, length 48
09:25:06.797435 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48

2024-05-13: Slow Rate Limiting

For a slow-update podcast feed such as mine, polling maybe once or twice per day is reasonable. Polling once per hour is customary, though still far too much. Polling much more than once per hour is foolish. One poll per hour for a week is 168 polls (8 days' worth is 192). It would be relatively easy to automate an analysis of the previous week's feed polls, and block with 429s outside the magic noon free-for-all window, regardless of (say) battery level, all or nearly all requests from bots that made more than ~200 requests and/or the top handful by request count or bytes. (Rejections could still be restricted to unconditional requests...)

For this last week that would have included:

1711 11259905 200:304:406:429:SH 242 0 567 902 696 "iTMS"
1647 8372482 200:304:406:429:SH 835 271 0 541 686 "Spotify/1.0"
1232 11778050 200:304:406:429:SH 997 0 8 227 393 "Amazon Music Podcast"
752 6162639 200:304:406:429:SH 524 0 0 228 165 "Podbean/FeedUpdate 2.1"
474 612922 200:304:406:429:SH 20 0 225 229 176 "-"
237 369734 200:304:406:429:SH 27 149 0 61 100 "Mozilla/5.0 (Windows NT 10.0; Win
64; x64) ..."
6"
219 1846087 200:304:406:429:SH 154 0 0 65 88 "Mozilla/5.0 (Linux;) AppleWebKit/
Chrome/ Safari - iHeartRadio"

A txt RewriteMap keyed on an escaped or hashed User-Agent might be enough, with only the very greedy bots being listed in the map at all.

I suspect that iTunes for one might throw a fit if reined in that hard.

There are complications around multiple independent users of one feed reader, polling of more than one feed (eg the podcast feed has two variants), polling of a feed in more than one step (eg HEAD then GET), spoofing (I do not believe that the first Mozilla User-Agent I have truncated above is an authentic browser), and single User-Agents with many random-ish cloud IP addresses. So maybe I should manage the map manually initially. Or set the threshold significantly higher and keep an eye on behaviour.

(My default cache expiry is a little over 4h, so any one client simply obeying that would only be making ~42 polls per week.)

When rejecting one of these with 429 it may be better to have any random rejection element varying slowly, eg by hour, so that naughty quick retries will not work. Then no goes on meaning no.

2024-05-14: First pass

Here is a first pass map file, with UAs shown truncated to reduce the chance of disclosing any PII:

# Greedy podcast feed pullers: keys are MD5 hashed User-Agent.
# A non-empty txt map lookup of the %{md5:%{HTTP:User-Agent}} means bad!
# Built: 2024-05-14T08:53+00:00
# MAXHITS: 400
# MAXUAS: 10
#----------------
# 1711 iTMS
97f76eb7e02c5ff923e1198ff1c288cd 4
# 1647 Spotify/1.0
4582d9bdbcef42af27d89da91c6eb804 4
# 1232 Amazon Music Po
d69be2563c9f1929edf2906d41809aea 3
# 752 Podbean/FeedUpda
54e0e9df937b06cc83fab29f44c02b7f 1
# 474 -
d41d8cd98f00b204e9800998ecf8427e 1

And here is sample Apache configuration to enforce it:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteMap greedyRSSBotMap "txt:/path/to/rss/greedy-podcast-bot-map.txt"
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
# Reject almost all for random entire hours, to defeat quick retries.
RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# Have any interaction with the filesystem as late as possible.
# Bot's UA hash is in the greedyRSSBotMap.
RewriteCond "${greedyRSSBotMap:%{md5:%{HTTP:User-Agent}}}" !^$ [NV]
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]

This does not work because it seems that only constants and some variables (eg %{HTTP:User-Agent}) but not environment variables can appear as the key for the map. So it is all commented out again for now. I could possibly do something based on masked IP, which should be less easily spoofable, but would not catch cloud bots.

Here is a rather more brutal manual version:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
## Reject almost all for random entire hours, to defeat quick retries.
#RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# Have any interaction with the filesystem as late as possible.
# DHD20240514: Bad bot manually selected by UA.
RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" ^$ [NV]
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]

This could be adjusted to allow all requests when battery is VHIGH, and/or a small fraction of requests as commented out in the above.

Here is a better dynamic version that only needs hash.flag files set to reject access, and is nice when VHIGH:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
## Reject almost all random entire hours, to defeat naughty quick retries.
#RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# DHD20240514: greedy/bad bots manually selected by UA, or no UA.
#RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" ^$ [NV,OR]
# Have any interaction with the filesystem as late as possible.
# Bot's hashed UA appears in flags dir?
RewriteCond expr "-f '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" [NV]
# Relent and allow request if battery SoC is (very) high.
RewriteCond /run/EXTERNAL_BATTERY_VHIGH.flag !-f
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]

Note that this only blocks requests for the podcast RSS feeds, and the request count used to drive this only looks at their fetches.

2024-05-12: If-Modified-Since Regex Error

I think that my regex to try to validate If-Modified-Since is broken, so I am reverting to check that it is just non-empty for now. I have seen essentially none that have been badly syntactically broken, when looking at the forensics logs, so I should cut the complexity.

Also it seems that Apache 2.4 is using If-Modified-Since for an inequality check, not a validator or exact match only as one might hope, which is good.

Instead as a temporary measure I am treating bogus very old If-Modified-Since dates as if they were in fact absent for the purposes of random 429 rejections. This is temporary since it might be hard to keep when/if ETags are back, and would also need manual updating to match too-old-to-be-real. This mainly affects one not-too-greedy client, though is drafted more generally.

More bugs from complexity

Whoops! My attempt to serve lite HTML pages to clients that would not accept compression went wrong and was preventing me sending precompressed HTML content to some specialised clients, eg that only offer br (this tester for example). A fix to make it more robust is:

-  # No precompression usable, or client does not support even gzip compression.
-  RewriteCond %{HTTP:Accept-Encoding} !gzip [OR]
+  # No precompression usable, or client does not support gzip or br compression.
+  RewriteCond %{HTTP:Accept-Encoding} !((gzip)|(br)) [OR]
   RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR]
   RewriteCond %{HTTP:Save-Data} on [NC]
   RewriteCond %{DOCUMENT_ROOT}/m%{REQUEST_FILENAME} -s

The tester still gives misleading results from being sent a lite page where it offers no compression model at all for a desktop page, but this is a very special case. Almost no real client will offer just br, or indeed any single method other than gzip or identity.

GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:gzip|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:br|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:zstd|Host:www.earth.org.uk

This tester also suggests that there is no point in using zstd for static compression, and little for dynamic over gzip -6, at least in terms of bandwidth and CPU at my end.

Compressible extensions

I have also being extending the list of file/URL extensions that should get on-the-fly gzip if not already covered by something else:

AddOutputFilter DEFLATE bib
AddOutputFilter DEFLATE csv
AddOutputFilter DEFLATE dat
AddOutputFilter DEFLATE log
AddOutputFilter DEFLATE md
AddOutputFilter DEFLATE mid
AddOutputFilter DEFLATE rss
AddOutputFilter DEFLATE sh
AddOutputFilter DEFLATE svg
AddOutputFilter DEFLATE tsv
AddOutputFilter DEFLATE txt
AddOutputFilter DEFLATE vtt
AddOutputFilter DEFLATE xml

This should catch a few more stragglers.

Apple, what are you doing?

This sort of nonsense in my logs annoys me, suggesting that some Apple engineer wants to pretend that HTTP is a generic RPC mechanism and ignore most of HTTP's conventions and etiquette thus wasting lots of CPU and bandwidth (edited Apache mod_log_forensic extract):

HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /img/wordcloud/podcast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
...

Mastodon not allowing Brotli

One small factor that I reported as a issue that could reduce the impact of Mastodon preview stampedes would be allowing Brotli compression of the HTML which would reduce data volume a little. I was told it did. Looking at a couple of forensics log entries suggests not, for fairly recent versions:

GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.8; +https%3a//0rb.it/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a08 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close
GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.7; +https%3a//m.trisweb.com/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a17 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close

Non-skipHours 429s

The non-skipHours random-fraction 429 rejections will now only happen during some sort of resource constraint, such as high-grid intensity or low battery. So now:

  • during skipHours, no-Referer unconditional requests, when grid intensity is high or battery is low, anything other than squeaky clean, may be rejected with 429
  • other than during noon UTC, no-Referer unconditional requests, when grid intensity is high or battery is low, will have a small fraction randomly rejected with 429
  • other than during noon UTC, no-Referer unconditional incompressible requests, will almost all be rejected with 406 to reduce bandwidth to closer to what allowing compression would have yielded

The success Cache-Control max-age and failure 429 Retry-After vary by time of day to try to steer hits away from skipHours, and are always many hours.

2024-05-11: Liter Podcast Feed

Given that the result of me pointing out to Spotify that its If-Modified-Since header was slightly broken seems to have been for it to be removed entirely, preventing 304 results, I have reduced the maximum number of entries in the podcast 'lite' feed to 5. That reduces wasted bandwidth a little.

100547 rss/podcast.rss
 11035 rss/podcast.rssgz
  9097 rss/podcast.rssbr
  5491 rss/podcast-lite.rss
  1745 rss/podcast-lite.rssgz
  1461 rss/podcast-lite.rssbr

Spotify seems to have reinstated conditional GETs (it is getting 304 responses again) but the deed is done.

I am capturing a little more forensics to see what is happening. Yes, it seems to be back, though the date is now double-digit anyway.

GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive

2024-05-12: If-Modified-Since stopped again

For no obvious reason, this afternoon:

GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive

2024-05-10: Date Regex Relax

I have relaxed the regexes checking If-Modified-Since to accept 7 May as well as 07 May (see earlier) as a benign deviation from the spec that will allow a few more 304 responses.

# The official HTTP/1.1 (and newer) date format is fixed length of the form:
#     Tue, 07 May 2024 09:46:11 GMT
# and can be matched by the following:
#RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3][0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV]
# In practice a few otherwise reasonably well-behaved clients drop
# any leading zero on the date, which Apache seems to be able to parse, eg:
#     Tue, 7 May 2024 09:46:11 GMT
# This should be able to be matched by the following:
#RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3]?[0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV]

2024-05-08: Sec-CH-UA-Mobile: ?1

For automatically selecting the lite/mobile version of main pages (but not necessarily of images for example) I now treat Sec-CH-UA-Mobile: ?1 as equivalent to Save-Data: on.

Note that Sec-CH-UA-Mobile is 'experimental', but adding/removing it is easy in this case (in three places):

+  RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR]
   RewriteCond %{HTTP:Save-Data} on [NC]

(I saw this header in the forensics, though it only was on 6 requests out of ~1100, and none of those were HTML pages.)

This mechanism seems to be forcing lite EOU pages for my Fairphone 3 running Android 13 and Chrome 124.

Googlebot's mobile / smartphone fetch does not seem to set this flag.

2024-05-07: mod_log_forensic

This evening's fun is to try Apache 2.4 module mod_log_forensic, enabling it with a2enmod log_forensic and in the EOU config something like:

<IfModule log_forensic_module>
ForensicLog ${APACHE_LOG_DIR}/forensic.log
</IfModule>

Immediately I see a snippet such as this:

|GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a22 GMT|

The 7 May rather than 07 May seems to be technically malformed for example, compared to:

|GET /rss/note-on-site-technicals.rss HTTP/1.1|Host:www.earth.org.uk|Connection:Keep-Alive|Accept-Encoding:gzip|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a26 GMT|User-Agent:SpaceCowboys Android RSS Reader / 2.6.24(309)

Should I really get into editing malformed headers? I think that Apache probably accepts both forms. But I think that I can treat this variant as malformed to reject some requests randomly. Nominally three formats are acceptable, and extra whitespace should be tolerated.

Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
Sun Nov  6 08:49:37 1994       ; ANSI C's asctime() format

Apache itself generates the 07 format for Date and Expires and Last-Modified.

% wget --compression=auto --debug -S -O out https://www.earth.org.uk/sitemap.atom
...
---response begin---
HTTP/1.1 200 OK
Date: Tue, 07 May 2024 18:15:00 GMT
Server: Apache
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Tue, 07 May 2024 15:38:44 GMT
Vary: Accept-Encoding,Referer
Content-Encoding: gzip
Cache-Control: max-age=3600
Expires: Tue, 07 May 2024 19:15:00 GMT
X-Frame-Options: DENY
access-control-allow-origin: *
Content-Length: 1384
Keep-Alive: timeout=5, max=100
Content-Type: application/atom+xml

---response end---

Some Spotify requests get 304 responses, while others are rejected by the bad-unconditional-requests logic:

[07/May/2024:18:03:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"
[07/May/2024:18:10:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:17:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:24:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:31:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"

gPodder

Interestingly this poking around seems to have found a minor issue in gPodder where it was retaining a stale ETag that had not been present in responses for several days.

Hurrah for forensics!

2024-05-06: 406/429 Noon Hole

There has just been more than 24h with high grid carbon intensity, and all iTMS (iTunes) poll attempts rejected with 429 or 406. I have added a noon hole, for an hour, when there is most likely to be some solar power off-grid and on, in the 406 defence. Even some of the semi-bad bots should be able to make a successful poll then. iTunes made it through the hole today!

...
[06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
...

I have also reinstated an all-day 429 rule for unconditional GETs, and given it a matching noon hole. So the common brute-force, every-hour, but compression-enabled bots should be able to get through it at least daily.

ETag paranoia

I have also added to the top of the EOU config unsetting any If-None-Match request header to avoid interfering with If-Modified-Since, so now the block is:

# When upgrading to a new-enough (eg 2.5) Apache, enable below
# and re-allow ETags elsewhere eg for large static image / audio / video.
#DeflateAlterETag Remove
# Until Apache 2.5, disable ETag entirely for this site.
FileETag none
Header unset ETag
RequestHeader unset If-None-Match

If-Modified-Since paranoia

I am testing validating If-Modified-Since with a regex, for this and the 429 case, rather than not just being empty or missing, something like:

^(Mon|Tue|Wed|Thu|Fri|Sat|Sun), ([0-3][0-9]) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (20[0-9][0-9]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9] GMT$

mod_dumpio

I am trying out Apache's mod_dumpio which seems to be available in my server's build. I expect to need to issue a sudo a2enmod dump_io to enable it, or the equivalent in my ansible setup if I would like the module permanently available (I probably do not). (And a2dismod to disable when I have finished...)

In any case, I am making my use conditional on the module being present, which should avoid breaking things either way.

<IfModule mod_dumpio.c>
DumpIOInput On
LogLevel alert dumpio:trace7
</IfModule>

Seemingly this has to be at server level, not virtual-host level...

I made all that work, but the output does not seem very helpful! So I have turned it all off again for now.

[Mon May 06 12:29:24.857553 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [init-blocking] 0 readbytes
[Mon May 06 12:29:24.912990 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.913114 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11
[Mon May 06 12:29:24.913302 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.913359 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11
[Mon May 06 12:29:24.913444 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-blocking] 65536 readbytes
[Mon May 06 12:29:24.922025 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 148 bytes
[Mon May 06 12:29:24.922107 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n
[Mon May 06 12:29:24.922193 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.922461 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 1521 bytes
[Mon May 06 12:29:24.922498 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT):
[Mon May 06 12:29:24.922905 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.923008 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 9 bytes
...

Time hash

I have adjusted the %{md5:%{TIME}} 406 randomiser to instead use TIME_MIN for its fastest-changing part so that a very quick (naughty) retry is more likely to get an unchanged result. (Eg from a bot that does a HEAD before each GET.) Thus bad behaviour should be less likely to get an undeserved 200. I could fold into the hash the IP address and/or User-Agent at a tiny extra CPU cost in order to reduce the risk of stampedes, but maybe also reducing (OS) cache effectiveness a tad.

This would make things harder for real humans attempting to sign up to the feed, or attempting a manual forced refresh, so is not used for 429 rejections, including daytime, of unconditional GETs.

Podbean in particular with its bad fast retry after 429 will evade this defence for now.

2024-05-05: Glossary Lite

I added a heat-pump entry to the glossary.

I also now have a lighter-weight version of the glossary built for non-desktop versions, showing at most one image per entry. Other trimming might make sense too.

2024-05-02: Random Early Drop 429

I now also randomly reject about half of unconditional GETs with 429s during skipHours even when everything else (eg battery) is OK. Half so that a human being manually signing up for the first time, or refreshing their feed, need typically retry once only. But even so, about halving the rest of the useless overnight bytes, and signalling more unhappiness clues for any engineer that looks. This should trim traffic from Amazon in particular.

RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR]

I further extended this by just rearranging the clauses so that for any Referer-less unconditional fetch of an RSS feed file during skipHours, and for about 50% out of skipHours, the remainder of the 429 rejection rules come into play. So this should trim some daytime waste also.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
# Potentially reject ~50% at random to trim bandwidth, but allow manual retries.
RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR]
# During skipHours.
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
# Not allowing compression is rude.
RewriteCond %{HTTP:Accept-Encoding} !gzip [OR]
# Not saying who you are (no User-Agent) is rude.
RewriteCond %{HTTP:User-Agent} ^$ [OR]
# Have any interaction with the filesystem as late as possible.
# Battery is low.
RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f [OR]
# Grid is high-intensity.
RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f
RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]

I adjusted the 406 random rejection percentage down to compensate.

I also now have the Header always set Retry-After instruction set alongside (and to the same duration as) the ExpiresByType application/rss+xml dependent on the hour of day, in case the client actually reacts properly to 429! (Note the REDIRECT_ prefix on the variable name.)

<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # This should be long enough to jump out of skipHours in one go.
    ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
    Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT
</If>
<ElseIf "%{TIME_HOUR} -gt 17">
    # Jump expiry right over coming skipHours block.
    ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
    Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<Else>
    # Give podcast RSS and similar feeds a default expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
    Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT
</Else>

This is what 429 rejection headers for wget look like:

HTTP/1.1 429 Too Many Requests
Date: Thu, 02 May 2024 12:25:37 GMT
Server: Apache
Retry-After: 14720
Vary: Accept-Encoding,Referer
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Sun, 28 Apr 2024 11:05:28 GMT
Accept-Ranges: bytes
Content-Length: 925
X-Frame-Options: DENY
Keep-Alive: timeout=5, max=100
Content-Type: text/html

And others (406):

HTTP/1.1 406 Not Acceptable
Date: Thu, 02 May 2024 12:32:36 GMT
Server: Apache
Vary: Accept-Encoding,Referer
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Sun, 28 Apr 2024 11:05:22 GMT
Accept-Ranges: bytes
Content-Length: 928
X-Frame-Options: DENY
Keep-Alive: timeout=5, max=100
Content-Type: text/html

And success (200):

HTTP/1.1 200 OK
Date: Thu, 02 May 2024 12:34:15 GMT
Server: Apache
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Wed, 01 May 2024 12:04:22 GMT
Accept-Ranges: bytes
Content-Length: 100547
Vary: Accept-Encoding,Referer
Cache-Control: max-age=14820
Expires: Thu, 02 May 2024 16:41:15 GMT
X-Frame-Options: DENY
access-control-allow-origin: *
Keep-Alive: timeout=5, max=100
Content-Type: application/rss+xml

This is not so far trimming Amazon daytime traffic, however, because the the grid and battery SoC are fine!

It also seems (though I have yet to take detailed header dumps) that some clients may be faking or messing up conditional requests, thus evading some of my controls...

I have now separated out another 429 block that rejects 50% of Refer-less unconditional requests, and removed the 50% rejection from the main 429 rule. This more aggressively rejects traffic from such as Amazon and fyyd at all times.

iTunes panic

iTunes (iTMS) really dislikes being asked to cool it! After being shut out for a while it then unconditionally demands copies of all the (immutable) cover art, often duplicates in quick succession. Apple should be embarrassed about this poor implementation, I think.

[02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[02/May/2024:16:32:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[02/May/2024:16:32:27 +0000] "HEAD /img/wordcloud/podcast-1.png HTTP/1.1" 200 339 "-" "iTMS"
[02/May/2024:16:32:28 +0000] "HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:28 +0000] "GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 105606 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "HEAD /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 1963 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "HEAD /img/16WW/measuring-mains-water-temperature-at-kitchen-sink-with-N84FR-1800w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "HEAD /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "GET /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 54330 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "HEAD /img/storage/storage-sq-640w.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "GET /img/storage/storage-sq-640w.png HTTP/1.1" 200 3103 "-" "iTMS"
...

I have Amazon Music Podcast pull down (equally badly) a full set of cover art images...

Cold feet

After an overnight run I got cold feet about this 429 rule and I disabled it, bumping up the 406 rejection percentage a little instead.

I reinstated a milder version of this random rejection rule, so that it only happens in skipHours to unconditional no-Referer requests, and it is now one of the OR-ed subconditions that allows a 429:

  • no User-Agent
  • no gzip compression allowed by Accept-Encoding
  • ~50% random chance
  • battery low
  • grid intensity high

I cranked up a separate streamlined version of this 429 rule to apply at any time other than noon (noon UTC is a hole in the defences to match 406), and so the skipHours-only version can lose its random part.

The mixed 4xx status messages to clients, and occasional wrong delivery to potential new listeners is sub-optimal. Though it does correctly hint that there is more than one thing wrong with the effected requests.

The 429s should slow Amazon down a bit, immediately.

2024-05-01: Random Early Drop 406

For those RSS feed clients that do not allow compression, outside skipHours times I now aim to (kinda) Random Early Drop (RED) a high enough fraction of requests to roughly compensate for the wasted bandwidth in those allowed, ie about 75% or above (typically over 4x compression with gzip).

Effective randomness is achieved by matching a single digit of the MD5 hash of the full request date/time. This fairly cheap single line in the 406 rule can be tweaked in various ways if needed.

RewriteCond expr "'%{md5:%{TIME}}' =~ /^[^abc]/" [OR]

For example, successively running wget --debug -S -O out https://www.earth.org.uk/rss/podcast.rss, which does not offer/allow gzip compression, gives:

[02/May/2024:08:03:34 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5"
[02/May/2024:08:03:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:03:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:03:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:03 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5"
[02/May/2024:08:04:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"

More pertinently on a sample of Apple iTunes requests:

[02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:08:37:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:37:48 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:08:51:40 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:51:41 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"