Earth Notes: On Website Technicals (2024-05)
Updated 2024-06-23.2024-05-26: CORS Redux
I was not getting the right CORS-friendly header set for RSS feed files. I guessed that was because the Apache FilesMatch
is based on the literal name of the file served, not the request URL, so if we serve precompressed file its extension would not match. Fixed, maybe:
# Allow CORS to work for RSS and Atom feeds and transcripts. # This allows for precompressed versions also. # This allows browsers to access them from non-EOU pages. <IfModule mod_headers.c> <FilesMatch "\.(rss|vtt|atom|xml)(gz|br)?$"> Header set access-control-allow-origin * # Remove unhelpful header to save bandwidth. Header unset Accept-Ranges </FilesMatch> </IfModule>
2024-05-24: Googlebot Behaving Badly
... www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:49:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:51:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:52:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:53:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:54:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:56:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:57:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:58:28 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:59:29 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ... www.earth.org.uk:443 66.249.70.1 - - [24/May/2024:10:12:48 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.70.8 - - [24/May/2024:10:13:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13322 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.70.8 - - [24/May/2024:10:17:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.70.2 - - [24/May/2024:10:18:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.70.2 - - [24/May/2024:10:19:23 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13322 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ...
More than 200 such requests before 08:00Z today! (~270 by 10:20Z, ~400 by 15:20Z, 814 by end of day, and only 439 other fetches (194 from EOU) in the day (bye-bye crawl budget!); still grinding away and over 300 by 06:30Z the next morning...) That will get it into the greedy bots list for maybe a week or two!
Yep, there it is at number 7:
# Built: 2024-05-24T09:40+00:00 # MAXHITSPERUAPERDAY: 25 # MAXUAS: 10 #---------------- # request-count User-Agent # MD5hash approx-hits-per-day # 201 iTMS 97f76eb7e02c5ff923e1198ff1c288cd 201 # 190 Spotify/1.0 4582d9bdbcef42af27d89da91c6eb804 190 ... # 34 Mozilla/5.0 (comp af26a291b02611b619f29585bbb19ac9 34 ...
I have filed feedback via the GSC. I have also DMed a couple of Google people via Mastodon, and had a long chat in DMs!
By the , Googlebot was in 4th place, above Amazon!
# Greedy podcast feed pullers: keys are MD5 hashed User-Agent. # A non-empty txt map lookup of the %{md5:%{HTTP:User-Agent}} means bad! # Built: 2024-05-25T09:49+00:00 # MAXHITSPERUAPERDAY: 25 # MAXUAS: 10 #---------------- # request-count User-Agent # MD5hash approx-hits-per-day # 201 iTMS 97f76eb7e02c5ff923e1198ff1c288cd 201 # 191 Spotify/1.0 4582d9bdbcef42af27d89da91c6eb804 191 # 180 Podbean/FeedUpda 54e0e9df937b06cc83fab29f44c02b7f 180 # 111 Mozilla/5.0 (com af26a291b02611b619f29585bbb19ac9 111 # 93 Amazon Music Podc d69be2563c9f1929edf2906d41809aea 93 # 51 - d41d8cd98f00b204e9800998ecf8427e 51 # 47 Mozilla/5.0 (Linu e30b15ba56c2a3b272646f0a386feb95 47 # 29 Mozilla/5.0 (Wind 6b9a00393fb1607b0ada13520f814ab5 29
2024-05-26: still flailing!
Podcast RSS feed poll hits today/recent (26/May/2024) at 2024-05-26T08:42Z.
Count | Partial User-Agent |
---|---|
422 | Mozilla/5.0 (comp |
88 | iTMS |
75 | Spotify/1.0 |
37 | Gofeed/1.0 |
34 | - |
28 | Podbean/FeedUpdat |
19 | Amazon Music Podc |
14 | Mozilla/5.0 (Wind |
9 | Overcast/1.0 Podc |
9 | fyyd-poll-1/0.5 |
Googlebot is now in top spot and continuing to retry roughly every minute...
2024-05-23: RSS Time Series HTML
I have started writing scripts to generate HTML time-series analysis of the RSS data.
I added a new simple measure of the interval in days covered by each block of data, which tends to flip between 6 and 8 days, even though nominally 7. This makes normalisation easier. I also hand-crafted some of the subsidiary data extract files for older data blocks that are generated by newer runs of the extraction code, to keep the time-series analysis scripts simple.
Here is a sample table:
Interval | Feed/day | % of EOU site traffic | Note | |||
---|---|---|---|---|---|---|
Ending | Days | Hits | MBytes | Hits | Bytes | |
2024-04-01T06:25Z | 8 | 1077 | 16.8 | 7.5 | 1.5 | |
2024-04-15T06:25Z | 8 | 1205 | 21.9 | 8.3 | 2.3 | |
2024-04-21T06:25Z | 6 | 1027 | 11.6 | 4.2 | 0.9 | |
2024-04-29T06:25Z | 8 | 1204 | 12.8 | 7.6 | 0.7 | 2024-04-23: added Spotify with lite feed |
2024-05-05T06:25Z | 6 | 1285 | 12.6 | 8.9 | 1.4 | |
2024-05-13T06:24Z | 8 | 1070 | 7.1 | 6.4 | 0.8 | |
2024-05-19T06:25Z | 6 | 1144 | 6.1 | 7.5 | 0.6 |
About 1 hit per day per distinct client, thus ~20hit/d, with ~10kB/hit on a day when the feed changes (else much less), so ~20 hits and 0.2MB per day is what should be here with efficient clients.
2024-05-20: Faster Rate Limiting
I have sped up the 'slow' rate limiting to potentially update each day (when the sun is out), with a nominal limit for each User-Agent
of one hit per hour. A bit tighter than before, thus catching a couple of extra User-Agent
s already. Still loose enough to allow the conventional 1-poll-per-hour mob through. Or what should correspond to several humans behind any one client User-Agent
that respects Cache-Control
max-age
and/or skipHours
, etc...
2024-05-19: Feed Diet
The logs rolled today, so a from this 6-day block:
% sh ./prepareStats.sh INFO: /tmp/stats.out/interval.txt: 2024-05-13T06:25:12 to 2024-05-19T06:25:10 inclusive log data INFO: hits: all 222476, site 92016, feed 6864 INFO: bytes: all 17563828813, site 6051922187, feed 36625901 INFO: /tmp/stats.out/allHitsByHour.log: all hits by hour (UTC)... 7541 705596235 00 7797 601362129 01 9212 848530101 02 8891 693462632 03 7931 1161997126 04 INFO: /tmp/stats.out/siteHitsByHour.log: site hits by hour (UTC)... 3019 247595118 00 2730 97503469 01 3802 213266786 02 4797 187066576 03 3305 586177279 04 INFO: /tmp/stats.out/feedHits.log: RSS feed hits... www.earth.org.uk:80 52.39.X.X - - [13/May/2024:06:27:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11588 "-" "Amazon Music Podcast" www.earth.org.uk:80 139.177.X.X - - [13/May/2024:06:27:42 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 183 "-" "Aggrivator (PodcastIndex.org)/v0.1.7" www.earth.org.uk:80 107.23.X.X - - [13/May/2024:06:29:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11588 "-" "Amazon Music Podcast" www.earth.org.uk:443 34.78.X.X - - [13/May/2024:06:31:30 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 200 5573 "-" "Spotify/1.0" www.earth.org.uk:80 18.185.X.X - - [13/May/2024:06:32:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 9561 "-" "axios/1.5.1" INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA... 6864 36625901 ALL 1351 7312529 "iTMS" 1234 5467259 "Spotify/1.0" 1086 2844616 "Podbean/FeedUpdate 2.1" 722 6124269 "Amazon Music Podcast" INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)... 270 1145460 00 276 1089997 01 235 1024117 02 239 1199946 03 224 1067031 04 INFO: /tmp/stats.out/feedStatusByUA.log: feed hits and status by UA... 6864 36625901 200:304:406:429:SH 2078 1028 328 3391 2538 ALL 1351 7312529 200:304:406:429:SH 144 0 216 991 557 "iTMS" 1234 5467259 200:304:406:429:SH 206 202 0 826 513 "Spotify/1.0" 1086 2844616 200:304:406:429:SH 191 0 0 895 231 "Podbean/FeedUpdate 2.1" 722 6124269 200:304:406:429:SH 511 0 2 209 245 "Amazon Music Podcast" INFO: /tmp/stats.out/feedStatusByHour.log: feed hits and status by hour (UTC)... 270 1145460 200:304:406:429:SH 67 41 0 162 270 00 276 1089997 200:304:406:429:SH 56 45 0 175 276 01 235 1024117 200:304:406:429:SH 54 33 0 146 235 02 239 1199946 200:304:406:429:SH 62 37 0 132 239 03 224 1067031 200:304:406:429:SH 62 32 0 128 224 04
A fragment of live site stats (including RSS) from yesterday morning:
Stats updated: 2024-05-18T11:32Z
Stat | Value |
---|---|
Fraction bot hits (target <0.5) | 0.575 |
Fraction GET 200s | 0.715 |
Fraction GET 206s | 0.001 |
Fraction GET 301s | 0.026 |
Fraction GET 302s | 0.061 |
Fraction GET 304s | 0.067 |
Fraction GET 400s | 0.000 |
Fraction GET 403s | 0.000 |
Fraction GET 404s | 0.021 |
Fraction GET 406s | 0.001 |
Fraction GET 410s | 0.000 |
Fraction GET 416s | 0.000 |
Fraction GET 429s | 0.017 |
Fraction GETs | 0.909 |
Fraction HEADs | 0.073 |
Fraction HTTP hits | 0.205 |
Fraction HTTPS hits | 0.795 |
Fraction human main-page GET 200s | 0.043 |
Fraction human m/(m+www) main-page GET 200s | 0.147 |
Fraction RSS podcast feed bytes RSS/www (target <0.01) | 0.007 |
Fraction RSS podcast feed hits RSS/www (target <0.045) | 0.068 |
Fraction site hits amp/(amp+www) | 0.019 |
Fraction site hits m/(m+www) | 0.068 |
Mean human main-page transfer bytes m | 21436 |
Mean human main-page transfer bytes www | 27088 |
Mean transfer bytes amp | 474 |
Mean transfer bytes m | 31961 |
Mean transfer bytes www | 59668 |
Unique days seen | 14 |
Unique hours seen | 318 |
Unique human IPs/day equivalent | 152 |
Podcast RSS feed poll hits today/recent (18/May/2024) at 2024-05-18T11:32Z.
Count | Partial User-Agent |
---|---|
112 | iTMS |
99 | Spotify/1.0 |
51 | Podbean/FeedUpdat |
23 | Amazon Music Podc |
22 | - |
20 | Mozilla/5.0 (Linu |
16 | Mozilla/5.0 (Wind |
16 | itms |
12 | fyyd-poll-1/0.5 |
11 | iVoox Global Podc |
Amazon pays some attention to 429
rejections and so has moved down the list. The rejection of almost all iTunes (iTMS
) requests to compensate for the lack of compression trims bandwidth as planned. But at first blush, overall hit and bytes totals are not as low as hoped. Possibly the addition of Spotify in particular has made feed dieting harder.
20240417/siteHitsByHour.log:115578 7636896047 ALL 20240421/siteHitsByHour.log:145238 7919447459 ALL 20240429/siteHitsByHour.log:125957 14722083002 ALL 20240505/siteHitsByHour.log:86713 5429515419 ALL 20240513/siteHitsByHour.log:134388 7239370999 ALL 20240519/siteHitsByHour.log:92016 6051922187 ALL
From (8d):
% cat 20240417/interval.txt 2024-04-07T06:25:14 to 2024-04-15T06:25:10 inclusive log data % head 20240417/feedHitsByUA.log 9643 175368483 ALL 2806 34128111 "Amazon Music Podcast" 2401 73456937 "iTMS" 542 6380012 "Podbean/FeedUpdate 2.1" 483 8501504 "-" 360 4300947 "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio" 250 3799126 "itms" 242 2815836 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" 196 192996 "FeedBurner/1.0 (http://www.FeedBurner.com)" 192 2258465 "fyyd-poll-1/0.5"
To (6d):
% cat 20240519/interval.txt 2024-05-13T06:25:12 to 2024-05-19T06:25:10 inclusive log data % head 20240519/feedHitsByUA.log 6864 36625901 ALL 1351 7312529 "iTMS" 1234 5467259 "Spotify/1.0" 1086 2844616 "Podbean/FeedUpdate 2.1" 722 6124269 "Amazon Music Podcast" 317 444963 "-" 243 2528919 "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio" 188 359888 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" 144 1401495 "fyyd-poll-1/0.5" 128 34775 "Aggrivator (PodcastIndex.org)/v0.1.7"
Updated stats for the first hour or so since the log roll this morning:
Stats updated: 2024-05-19T07:52Z
Stat | Value |
---|---|
Fraction bot hits (target <0.5) | 0.651 |
Fraction GET 200s | 0.652 |
Fraction GET 206s | 0.001 |
Fraction GET 301s | 0.051 |
Fraction GET 302s | 0.080 |
Fraction GET 304s | 0.091 |
Fraction GET 400s | 0.000 |
Fraction GET 403s | 0.000 |
Fraction GET 404s | 0.044 |
Fraction GET 406s | 0.001 |
Fraction GET 410s | 0.000 |
Fraction GET 429s | 0.022 |
Fraction GETs | 0.943 |
Fraction HEADs | 0.030 |
Fraction HTTP hits | 0.238 |
Fraction HTTPS hits | 0.762 |
Fraction human main-page GET 200s | 0.044 |
Fraction human m/(m+www) main-page GET 200s | 0.087 |
Fraction RSS podcast feed bytes RSS/www (target <0.01) | 0.006 |
Fraction RSS podcast feed hits RSS/www (target <0.045) | 0.074 |
Fraction site hits amp/(amp+www) | 0.029 |
Fraction site hits m/(m+www) | 0.073 |
Mean human main-page transfer bytes m | 15754 |
Mean human main-page transfer bytes www | 32294 |
Mean transfer bytes amp | 776 |
Mean transfer bytes m | 29038 |
Mean transfer bytes www | 64480 |
Unique days seen | 7 |
Unique hours seen | 146 |
Unique human IPs/day equivalent | 131 |
Podcast RSS feed poll hits today/recent (19/May/2024) at 2024-05-19T07:52Z.
Count | Partial User-Agent |
---|---|
13 | Spotify/1.0 |
12 | iTMS |
6 | Mozilla/5.0 (Linux |
6 | - |
4 | Podbean/FeedUpdate |
3 | Amazon Music Podca |
2 | Mozilla/5.0 (Macin |
2 | itms |
2 | gPodder/3.11.1 (+h |
2 | Aggrivator (Podcas |
Note that there have been no new podcast episodes since , and no substantive metadata updates recently either.
These heavy defences, erected as some feed users will not do the right thing, result in bystander casualties. For example, about 10% of distinct User-Agent
s failed to make any successful (200
or 304
) fetches at all, such as these (top 3):
29 47125 200:304:406:429:SH 0 0 0 0 16 "Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; bot@linkfluence.com)" 8 9886 200:304:406:429:SH 0 0 3 5 4 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0" 8 18732 200:304:406:429:SH 0 0 1 7 6 "Akkoma 3.11.0-0-ga03f3a9; https://social.wedistribute.org <hello@wedistribute.org>, Akkoma 3.11.0-0-ga03f3a9; https://social.wedistribute.org <hello@wedistribute.org>; Bot"
Most of these rejections are accesses in skipHours
.
A new episode was published around 4pm.
2024-05-17: Nice I/O
The main server (sencha
) is somewhat I/O constrained for its filesystem on microSD card. Some actions, such as a sweeping check/fix of permissions in the Web tree are a little intense and slow other activity down. Until now! The new command as run from cron
is in part:
nice ionice -c idle make perms
which in theory gets out of the way of most other CPU and I/O demand! It at least does not seem to break anything...
Given that I am using the mq-deadline
scheduler then ionice
probably is not doing anything:
% mount /dev/mmcblk0p2 on / type ext4 (rw,noatime,commit=300) ... % grep "" /sys/block/*/queue/scheduler ... /sys/block/loop7/queue/scheduler:[mq-deadline] kyber none /sys/block/mmcblk0/queue/scheduler:[mq-deadline] kyber none /sys/block/ram0/queue/scheduler:none ...
2024-05-16: ntp1 is Dead, Long Live ntp2!
I continue to get a huge amount of NTP traffic, ie many hits per second, though my server has not been stratum-1 for years. I suspect that my server name has been baked into software in routers, etc. So I have removed ntp1.exnet.com
from DNS to see if that reduces the flow a little. I may rename ntp.exnet.com
too, if the deluge continues!
A tiny snippet from tcpdump
:
09:25:06.712266 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48 09:25:06.727073 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48 09:25:06.729648 IP XXXX > 79.135.97.78.123: NTPv3, Client, length 48 09:25:06.730250 IP 79.135.97.78.123 > XXXX NTPv3, Server, length 48 09:25:06.734552 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48 09:25:06.750494 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48 09:25:06.750879 IP 79.135.97.78.123 > XXXX NTPv4, Server, length 48 09:25:06.797435 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48
2024-06-23: maybe a little quieter
Traffic may have dropped a little, but is still at several packets per second. So today ntp.exnet.com
is going away, just leaving ntp2
.
There are likely many long-running processes and devices that have already internalised the IP, so I can also considering changing the IP address to shake off that traffic.
2024-05-13: Slow Rate Limiting
For a slow-update podcast feed such as mine, polling maybe once or twice per day is reasonable. Polling once per hour is customary, though still far too much. Polling much more than once per hour is foolish. One poll per hour for a week is 168 polls (8 days' worth is 192). It would be relatively easy to automate an analysis of the previous week's feed polls, and block with 429
s outside the magic noon free-for-all window, regardless of (say) battery level, all or nearly all requests from bots that made more than ~200 requests and/or the top handful by request count or bytes. (Rejections could still be restricted to unconditional requests...)
For this last week that would have included:
1711 11259905 200:304:406:429:SH 242 0 567 902 696 "iTMS" 1647 8372482 200:304:406:429:SH 835 271 0 541 686 "Spotify/1.0" 1232 11778050 200:304:406:429:SH 997 0 8 227 393 "Amazon Music Podcast" 752 6162639 200:304:406:429:SH 524 0 0 228 165 "Podbean/FeedUpdate 2.1" 474 612922 200:304:406:429:SH 20 0 225 229 176 "-" 237 369734 200:304:406:429:SH 27 149 0 61 100 "Mozilla/5.0 (Windows NT 10.0; Win 64; x64) ..." 6" 219 1846087 200:304:406:429:SH 154 0 0 65 88 "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio"
A txt
RewriteMap
keyed on an escaped or hashed User-Agent
might be enough, with only the very greedy bots being listed in the map at all.
I suspect that iTunes for one might throw a fit if reined in that hard.
There are complications around multiple independent users of one feed reader, polling of more than one feed (eg the podcast feed has two variants), polling of a feed in more than one step (eg HEAD
then GET
), spoofing (I do not believe that the first Mozilla User-Agent
I have truncated above is an authentic browser), and single User-Agent
s with many random-ish cloud IP addresses. So maybe I should manage the map manually initially. Or set the threshold significantly higher and keep an eye on behaviour.
(My default cache expiry is a little over 4h, so any one client simply obeying that would only be making ~42 polls per week.)
When rejecting one of these with 429
it may be better to have any random rejection element varying slowly, eg by hour, so that naughty quick retries will not work. Then no goes on meaning no.
2024-05-14: First pass
Here is a first pass map file, with UAs shown truncated to reduce the chance of disclosing any PII:
# Greedy podcast feed pullers: keys are MD5 hashed User-Agent. # A non-empty txt map lookup of the %{md5:%{HTTP:User-Agent}} means bad! # Built: 2024-05-14T08:53+00:00 # MAXHITS: 400 # MAXUAS: 10 #---------------- # 1711 iTMS 97f76eb7e02c5ff923e1198ff1c288cd 4 # 1647 Spotify/1.0 4582d9bdbcef42af27d89da91c6eb804 4 # 1232 Amazon Music Po d69be2563c9f1929edf2906d41809aea 3 # 752 Podbean/FeedUpda 54e0e9df937b06cc83fab29f44c02b7f 1 # 474 - d41d8cd98f00b204e9800998ecf8427e 1
And here is sample Apache configuration to enforce it:
# Aggressively reject unconditional greedy podcast bots, other than noon. RewriteMap greedyRSSBotMap "txt:/path/to/rss/greedy-podcast-bot-map.txt" RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond %{HTTP_REFERER} ^$ # DHD20240512: using bogus very old IMS date is rude! RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] #RewriteCond %{HTTP:If-None-Match} ^$ [NV] # Reject almost all for random entire hours, to defeat quick retries. RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/" # Have any interaction with the filesystem as late as possible. # Bot's UA hash is in the greedyRSSBotMap. RewriteCond "${greedyRSSBotMap:%{md5:%{HTTP:User-Agent}}}" !^$ [NV] RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
This does not work because it seems that only constants and some variables (eg %{HTTP:User-Agent}) but not environment variables can appear as the key for the map. So it is all commented out again for now. I could possibly do something based on masked IP, which should be less easily spoofable, but would not catch cloud bots.
Here is a rather more brutal manual version:
# Aggressively reject unconditional greedy podcast bots, other than noon. RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond %{HTTP_REFERER} ^$ # DHD20240512: using bogus very old IMS date is rude! RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] #RewriteCond %{HTTP:If-None-Match} ^$ [NV] ## Reject almost all for random entire hours, to defeat quick retries. #RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/" # Have any interaction with the filesystem as late as possible. # DHD20240514: Bad bot manually selected by UA. RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR] RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR] RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR] RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR] RewriteCond "%{HTTP:User-Agent}" ^$ [NV] RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]
This could be adjusted to allow all requests when battery is VHIGH
, and/or a small fraction of requests as commented out in the above.
Here is a better dynamic version that only needs hash.flag
files set to reject access, and is nice when VHIGH
:
# Aggressively reject unconditional greedy podcast bots, other than noon. RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond %{HTTP_REFERER} ^$ # DHD20240512: using bogus very old IMS date is rude! RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] #RewriteCond %{HTTP:If-None-Match} ^$ [NV] ## Reject almost all random entire hours, to defeat naughty quick retries. #RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/" # DHD20240514: greedy/bad bots manually selected by UA, or no UA. #RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR] #RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR] #RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR] #RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR] RewriteCond "%{HTTP:User-Agent}" ^$ [NV,OR] # Have any interaction with the filesystem as late as possible. # Bot's hashed UA appears in flags dir? RewriteCond expr "-f '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" [NV] # Relent and allow request if battery SoC is (very) high. RewriteCond /run/EXTERNAL_BATTERY_VHIGH.flag !-f RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]
Note that this only blocks requests for the podcast RSS feeds, and the request count used to drive this only looks at requests for them.
2024-05-12: If-Modified-Since Regex Error
I think that my regex to try to validate If-Modified-Since
is broken, so I am reverting to check that it is just non-empty for now. I have seen essentially none that have been badly syntactically broken, when looking at the forensics logs, so I should cut the complexity.
Also it seems that Apache 2.4 is using If-Modified-Since
for an inequality check, not a validator or exact match only as one might hope, which is good.
Instead as a temporary measure I am treating bogus very old If-Modified-Since
dates as if they were in fact absent for the purposes of random 429
rejections. This is temporary since it might be hard to keep when/if ETag
s are back, and would also need manual updating to match too-old-to-be-real. This mainly affects one not-too-greedy client, though is drafted more generally.
More bugs from complexity
Whoops! My attempt to serve lite HTML pages to clients that would not accept compression went wrong and was preventing me sending precompressed HTML content to some specialised clients, eg that only offer br
(this tester for example). A fix to make it more robust is:
- # No precompression usable, or client does not support even gzip compression. - RewriteCond %{HTTP:Accept-Encoding} !gzip [OR] + # No precompression usable, or client does not support gzip or br compression. + RewriteCond %{HTTP:Accept-Encoding} !((gzip)|(br)) [OR] RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR] RewriteCond %{HTTP:Save-Data} on [NC] RewriteCond %{DOCUMENT_ROOT}/m%{REQUEST_FILENAME} -s
The tester still gives misleading results from being sent a lite page where it offers no compression model at all for a desktop page, but this is a very special case. Almost no real client will offer just br
, or indeed any single method other than gzip
or identity
.
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Host:www.earth.org.uk GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:gzip|Host:www.earth.org.uk GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:br|Host:www.earth.org.uk GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:zstd|Host:www.earth.org.uk
This tester also suggests that there is no point in using zstd
for static compression, and little for dynamic over gzip -6
, at least in terms of bandwidth and CPU at my end.
Compressible extensions
I have also being extending the list of file/URL extensions that should get on-the-fly gzip
if not already covered by something else:
AddOutputFilter DEFLATE bib AddOutputFilter DEFLATE csv AddOutputFilter DEFLATE dat AddOutputFilter DEFLATE log AddOutputFilter DEFLATE md AddOutputFilter DEFLATE mid AddOutputFilter DEFLATE rss AddOutputFilter DEFLATE sh AddOutputFilter DEFLATE svg AddOutputFilter DEFLATE tsv AddOutputFilter DEFLATE txt AddOutputFilter DEFLATE vtt AddOutputFilter DEFLATE xml
This should catch a few more stragglers.
Apple, what are you doing?
This sort of nonsense in my logs annoys me, suggesting that some Apple engineer wants to pretend that HTTP is a generic RPC mechanism and ignore most of HTTP's conventions and etiquette thus wasting lots of CPU and bandwidth (edited Apache mod_log_forensic
extract):
HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m GET /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m HEAD /img/wordcloud/podcast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m ...
Mastodon not allowing Brotli
One small factor that I reported as a issue that could reduce the impact of Mastodon preview stampedes would be allowing Brotli compression of the HTML which would reduce data volume a little. I was told it did. Looking at a couple of forensics log entries suggests not, for fairly recent versions:
GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.8; +https%3a//0rb.it/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a08 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.7; +https%3a//m.trisweb.com/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a17 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close
Non-skipHours 429s
The non-skipHours
random-fraction 429
rejections will now only happen during some sort of resource constraint, such as high-grid intensity or low battery. So now:
- during
skipHours
, no-Referer
unconditional requests, when grid intensity is high or battery is low, anything other than squeaky clean, may be rejected with429
- other than during noon UTC, no-
Referer
unconditional requests, when grid intensity is high or battery is low, will have a small fraction randomly rejected with429
- other than during noon UTC, no-
Referer
unconditional incompressible requests, will almost all be rejected with406
to reduce bandwidth to closer to what allowing compression would have yielded
The success Cache-Control
max-age
and failure 429
Retry-After
vary by time of day to try to steer hits away from skipHours
, and are always many hours.
2024-05-11: Liter Podcast Feed
Given that the result of me pointing out to Spotify that its If-Modified-Since
header was slightly broken seems to have been for it to be removed entirely, preventing 304
results, I have reduced the maximum number of entries in the podcast 'lite' feed to 5. That reduces wasted bandwidth a little.
100547 rss/podcast.rss 11035 rss/podcast.rssgz 9097 rss/podcast.rssbr 5491 rss/podcast-lite.rss 1745 rss/podcast-lite.rssgz 1461 rss/podcast-lite.rssbr
Spotify seems to have reinstated conditional GETs (it is getting 304
responses again) but the deed is done.
I am capturing a little more forensics to see what is happening. Yes, it seems to be back, though the date is now double-digit anyway.
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
2024-05-12: If-Modified-Since stopped again
For no obvious reason, this afternoon:
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
2024-05-10: Date Regex Relax
I have relaxed the regexes checking If-Modified-Since
to accept 7 May
as well as 07 May
(see earlier) as a benign deviation from the spec that will allow a few more 304
responses.
# The official HTTP/1.1 (and newer) date format is fixed length of the form: # Tue, 07 May 2024 09:46:11 GMT # and can be matched by the following: #RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3][0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV] # In practice a few otherwise reasonably well-behaved clients drop # any leading zero on the date, which Apache seems to be able to parse, eg: # Tue, 7 May 2024 09:46:11 GMT # This should be able to be matched by the following: #RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3]?[0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV]
2024-05-08: Sec-CH-UA-Mobile: ?1
For automatically selecting the lite/mobile version of main pages (but not necessarily of images for example) I now treat Sec-CH-UA-Mobile: ?1
as equivalent to Save-Data: on
.
Note that Sec-CH-UA-Mobile is 'experimental', but adding/removing it is easy in this case (in three places):
+ RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR] RewriteCond %{HTTP:Save-Data} on [NC]
(I saw this header in the forensics, though it only was on 6 requests out of ~1100, and none of those were HTML pages.)
This mechanism seems to be forcing lite EOU pages for my Fairphone 3 running Android 13 and Chrome 124.
Googlebot's mobile / smartphone fetch does not seem to set this flag.
2024-05-07: mod_log_forensic
This evening's fun is to try Apache 2.4 module mod_log_forensic
, enabling it with a2enmod log_forensic
and in the EOU config something like:
<IfModule log_forensic_module> ForensicLog ${APACHE_LOG_DIR}/forensic.log </IfModule>
Immediately I see a snippet such as this:
|GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a22 GMT|
The 7 May
rather than 07 May
seems to be technically malformed for example, compared to:
|GET /rss/note-on-site-technicals.rss HTTP/1.1|Host:www.earth.org.uk|Connection:Keep-Alive|Accept-Encoding:gzip|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a26 GMT|User-Agent:SpaceCowboys Android RSS Reader / 2.6.24(309)
Should I really get into editing malformed headers? I think that Apache probably accepts both forms. But I think that I can treat this variant as malformed to reject some requests randomly. Nominally three formats are acceptable, and extra whitespace should be tolerated.
Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123 Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036 Sun Nov 6 08:49:37 1994 ; ANSI C's asctime() format
Apache itself generates the 07 format for Date
and Expires
and Last-Modified
.
% wget --compression=auto --debug -S -O out https://www.earth.org.uk/sitemap.atom ... ---response begin--- HTTP/1.1 200 OK Date: Tue, 07 May 2024 18:15:00 GMT Server: Apache Upgrade: h2 Connection: Upgrade, Keep-Alive Last-Modified: Tue, 07 May 2024 15:38:44 GMT Vary: Accept-Encoding,Referer Content-Encoding: gzip Cache-Control: max-age=3600 Expires: Tue, 07 May 2024 19:15:00 GMT X-Frame-Options: DENY access-control-allow-origin: * Content-Length: 1384 Keep-Alive: timeout=5, max=100 Content-Type: application/atom+xml ---response end---
Some Spotify requests get 304
responses, while others are rejected by the bad-unconditional-requests logic:
[07/May/2024:18:03:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0" [07/May/2024:18:10:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0" [07/May/2024:18:17:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0" [07/May/2024:18:24:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0" [07/May/2024:18:31:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"
gPodder
Interestingly this poking around seems to have found a minor issue in gPodder
where it was retaining a stale ETag
that had not been present in responses for several days.
Hurrah for forensics!
2024-05-06: 406/429 Noon Hole
There has just been more than 24h with high grid carbon intensity, and all iTMS
(iTunes) poll attempts rejected with 429
or 406
. I have added a noon hole, for an hour, when there is most likely to be some solar power off-grid and on, in the 406
defence. Even some of the semi-bad bots should be able to make a successful poll then. iTunes made it through the hole today!
... [06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS" [06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS" [06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS" [06/May/2024:12:10:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS" [06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS" ...
I have also reinstated an all-day 429
rule for unconditional GETs, and given it a matching noon hole. So the common brute-force, every-hour, but compression-enabled bots should be able to get through it at least daily.
ETag paranoia
I have also added to the top of the EOU config unsetting any If-None-Match
request header to avoid interfering with If-Modified-Since
, so now the block is:
# When upgrading to a new-enough (eg 2.5) Apache, enable below # and re-allow ETags elsewhere eg for large static image / audio / video. #DeflateAlterETag Remove # Until Apache 2.5, disable ETag entirely for this site. FileETag none Header unset ETag RequestHeader unset If-None-Match
If-Modified-Since paranoia
I am testing validating If-Modified-Since
with a regex, for this and the 429
case, rather than not just being empty or missing, something like:
^(Mon|Tue|Wed|Thu|Fri|Sat|Sun), ([0-3][0-9]) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (20[0-9][0-9]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9] GMT$
mod_dumpio
I am trying out Apache's mod_dumpio
which seems to be available in my server's build. I expect to need to issue a sudo a2enmod dump_io
to enable it, or the equivalent in my ansible
setup if I would like the module permanently available (I probably do not). (And a2dismod
to disable when I have finished...)
In any case, I am making my use conditional on the module being present, which should avoid breaking things either way.
<IfModule mod_dumpio.c> DumpIOInput On LogLevel alert dumpio:trace7 </IfModule>
Seemingly this has to be at server level, not virtual-host level...
I made all that work, but the output does not seem very helpful! So I have turned it all off again for now.
[Mon May 06 12:29:24.857553 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [init-blocking] 0 readbytes [Mon May 06 12:29:24.912990 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes [Mon May 06 12:29:24.913114 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11 [Mon May 06 12:29:24.913302 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes [Mon May 06 12:29:24.913359 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11 [Mon May 06 12:29:24.913444 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-blocking] 65536 readbytes [Mon May 06 12:29:24.922025 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio: dumpio_in (data-TRANSIENT): 148 bytes [Mon May 06 12:29:24.922107 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio: dumpio_in (data-TRANSIENT): PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n [Mon May 06 12:29:24.922193 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes [Mon May 06 12:29:24.922461 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio: dumpio_in (data-TRANSIENT): 1521 bytes [Mon May 06 12:29:24.922498 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio: dumpio_in (data-TRANSIENT): [Mon May 06 12:29:24.922905 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes [Mon May 06 12:29:24.923008 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio: dumpio_in (data-TRANSIENT): 9 bytes ...
Time hash
I have adjusted the %{md5:%{TIME}}
406
randomiser to instead use TIME_MIN
for its fastest-changing part so that a very quick (naughty) retry is more likely to get an unchanged result. (Eg from a bot that does a HEAD
before each GET
.) Thus bad behaviour should be less likely to get an undeserved 200
. I could fold into the hash the IP address and/or User-Agent
at a tiny extra CPU cost in order to reduce the risk of stampedes, but maybe also reducing (OS) cache effectiveness a tad.
This would make things harder for real humans attempting to sign up to the feed, or attempting a manual forced refresh, so is not used for 429
rejections, including daytime, of unconditional GETs.
Podbean
in particular with its bad fast retry after 429
will evade this defence for now.
2024-05-05: Glossary Lite
2024-05-02: Random Early Drop 429
I now also randomly reject about half of unconditional GET
s with 429
s during skipHours
even when everything else (eg battery) is OK. Half so that a human being manually signing up for the first time, or refreshing their feed, need typically retry once only. But even so, about halving the rest of the useless overnight bytes, and signalling more unhappiness clues for any engineer that looks. This should trim traffic from Amazon in particular.
RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR]
I further extended this by just rearranging the clauses so that for any Referer
-less unconditional fetch of an RSS feed file during skipHours
, and for about 50% out of skipHours
, the remainder of the 429
rejection rules come into play. So this should trim some daytime waste also.
RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] #RewriteCond %{HTTP:If-None-Match} ^$ [NV] # Potentially reject ~50% at random to trim bandwidth, but allow manual retries. RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR] # During skipHours. RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" # Not allowing compression is rude. RewriteCond %{HTTP:Accept-Encoding} !gzip [OR] # Not saying who you are (no User-Agent) is rude. RewriteCond %{HTTP:User-Agent} ^$ [OR] # Have any interaction with the filesystem as late as possible. # Battery is low. RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f [OR] # Grid is high-intensity. RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
I adjusted the 406
random rejection percentage down to compensate.
I also now have the Header always set Retry-After
instruction set alongside (and to the same duration as) the ExpiresByType application/rss+xml
dependent on the hour of day, in case the client actually reacts properly to 429
! (Note the REDIRECT_
prefix on the variable name.)
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # This should be long enough to jump out of skipHours in one go. ExpiresByType application/rss+xml "access plus 10 hours 7 minutes" Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT </If> <ElseIf "%{TIME_HOUR} -gt 17"> # Jump expiry right over coming skipHours block. ExpiresByType application/rss+xml "access plus 14 hours 7 minutes" Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT </ElseIf> <Else> # Give podcast RSS and similar feeds a default expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT </Else>
This is what 429
rejection headers for wget
look like:
HTTP/1.1 429 Too Many Requests Date: Thu, 02 May 2024 12:25:37 GMT Server: Apache Retry-After: 14720 Vary: Accept-Encoding,Referer Upgrade: h2 Connection: Upgrade, Keep-Alive Last-Modified: Sun, 28 Apr 2024 11:05:28 GMT Accept-Ranges: bytes Content-Length: 925 X-Frame-Options: DENY Keep-Alive: timeout=5, max=100 Content-Type: text/html
And others (406
):
HTTP/1.1 406 Not Acceptable Date: Thu, 02 May 2024 12:32:36 GMT Server: Apache Vary: Accept-Encoding,Referer Upgrade: h2 Connection: Upgrade, Keep-Alive Last-Modified: Sun, 28 Apr 2024 11:05:22 GMT Accept-Ranges: bytes Content-Length: 928 X-Frame-Options: DENY Keep-Alive: timeout=5, max=100 Content-Type: text/html
And success (200
):
HTTP/1.1 200 OK Date: Thu, 02 May 2024 12:34:15 GMT Server: Apache Upgrade: h2 Connection: Upgrade, Keep-Alive Last-Modified: Wed, 01 May 2024 12:04:22 GMT Accept-Ranges: bytes Content-Length: 100547 Vary: Accept-Encoding,Referer Cache-Control: max-age=14820 Expires: Thu, 02 May 2024 16:41:15 GMT X-Frame-Options: DENY access-control-allow-origin: * Keep-Alive: timeout=5, max=100 Content-Type: application/rss+xml
This is not so far trimming Amazon daytime traffic, however, because the the grid and battery SoC are fine!
It also seems (though I have yet to take detailed header dumps) that some clients may be faking or messing up conditional requests, thus evading some of my controls...
I have now separated out another 429
block that rejects 50% of Refer
-less unconditional requests, and removed the 50% rejection from the main 429
rule. This more aggressively rejects traffic from such as Amazon
and fyyd
at all times.
iTunes panic
iTunes (iTMS
) really dislikes being asked to cool it! After being shut out for a while it then unconditionally demands copies of all the (immutable) cover art, often duplicates in quick succession. Apple should be embarrassed about this poor implementation, I think.
[02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS" [02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS" [02/May/2024:16:32:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS" [02/May/2024:16:32:27 +0000] "HEAD /img/wordcloud/podcast-1.png HTTP/1.1" 200 339 "-" "iTMS" [02/May/2024:16:32:28 +0000] "HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 340 "-" "iTMS" [02/May/2024:16:32:28 +0000] "GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 105606 "-" "iTMS" [02/May/2024:16:32:29 +0000] "HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 338 "-" "iTMS" [02/May/2024:16:32:29 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS" [02/May/2024:16:32:29 +0000] "HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 340 "-" "iTMS" [02/May/2024:16:32:29 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS" [02/May/2024:16:32:30 +0000] "HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 338 "-" "iTMS" [02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS" [02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS" [02/May/2024:16:32:30 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS" [02/May/2024:16:32:30 +0000] "HEAD /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 338 "-" "iTMS" [02/May/2024:16:32:31 +0000] "GET /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 1963 "-" "iTMS" [02/May/2024:16:32:31 +0000] "HEAD /img/16WW/measuring-mains-water-temperature-at-kitchen-sink-with-N84FR-1800w.jpg HTTP/1.1" 200 340 "-" "iTMS" [02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS" [02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS" [02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS" [02/May/2024:16:32:32 +0000] "HEAD /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 340 "-" "iTMS" [02/May/2024:16:32:32 +0000] "GET /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 54330 "-" "iTMS" [02/May/2024:16:32:32 +0000] "HEAD /img/storage/storage-sq-640w.png HTTP/1.1" 200 338 "-" "iTMS" [02/May/2024:16:32:32 +0000] "GET /img/storage/storage-sq-640w.png HTTP/1.1" 200 3103 "-" "iTMS" ...
I have Amazon Music Podcast
pull down (equally badly) a full set of cover art images...
Cold feet
After an overnight run I got cold feet about this 429
rule and I disabled it, bumping up the 406
rejection percentage a little instead.
I reinstated a milder version of this random rejection rule, so that it only happens in skipHours
to unconditional no-Referer
requests, and it is now one of the OR-ed subconditions that allows a 429
:
- no
User-Agent
- no
gzip
compression allowed byAccept-Encoding
- ~50% random chance
- battery low
- grid intensity high
I cranked up a separate streamlined version of this 429
rule to apply at any time other than noon (noon UTC is a hole in the defences to match 406
), and so the skipHours
-only version can lose its random part.
The mixed 4xx
status messages to clients, and occasional wrong delivery to potential new listeners is sub-optimal. Though it does correctly hint that there is more than one thing wrong with the effected requests.
The 429
s should slow Amazon down a bit, immediately.
2024-05-01: Random Early Drop 406
For those RSS feed clients that do not allow compression, outside skipHours
times I now aim to (kinda) Random Early Drop (RED) a high enough fraction of requests to roughly compensate for the wasted bandwidth in those allowed, ie about 75% or above (typically over 4x compression with gzip
).
Effective randomness is achieved by matching a single digit of the MD5 hash of the full request date/time. This fairly cheap single line in the 406
rule can be tweaked in various ways if needed.
RewriteCond expr "'%{md5:%{TIME}}' =~ /^[^abc]/" [OR]
For example, successively running wget --debug -S -O out https://www.earth.org.uk/rss/podcast.rss
, which does not offer/allow gzip
compression, gives:
[02/May/2024:08:03:34 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5" [02/May/2024:08:03:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:03:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:03:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:04:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:04:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:04:03 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5" [02/May/2024:08:04:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5" [02/May/2024:08:04:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
More pertinently on a sample of Apple iTunes requests:
[02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS" [02/May/2024:08:37:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [02/May/2024:08:37:48 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS" [02/May/2024:08:51:40 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [02/May/2024:08:51:41 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS" [02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS" [02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS" [02/May/2024:09:04:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS" [02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS" [02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"