Earth Notes: On Website Technicals (2024-04)
Updated 2024-05-06.2024-04-30: 429 then 406
For EOU RSS traffic I have rearranged the defences to return a 429
in preference to a 406
if both are applicable, since a 429
does seem to slow down Amazon for example. And the Retry-After
header may provide more control (than 406
) with better behaved clients.
The iTunes iTMS
bot continues to get 406
s for now (before 07:00Z, battery and grid OK). But a no-User-Agent
bot has just received a 429
where previously it might have had a 406
.
I also added not allowing gzip
compression to the list of sins that may result in a 429
during skipHours
.
2024-04-29: RSS Stats
The logs have rolled; time for new stats:
% sh ./prepareStats.sh INFO: /tmp/stats.out/interval.txt: 2024-04-21T06:25:13 to 2024-04-29T06:25:10 in11477 829577444 02 11539 779247358 03 9576 766272405 04 INFO: /tmp/stats.out/siteHitsByHour.log: site hits by hour (UTC)... 4816 326837195 00 4965 227080097 01 3814 158796652 02 5100 223972908 03 3961 168507529 04 INFO: /tmp/stats.out/feedHits.log: RSS feed hits... www.earth.org.uk:443 17.58.X.X - - [21/Apr/2024:06:25:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" www.earth.org.uk:443 17.58.X.X - - [21/Apr/2024:06:25:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS" www.earth.org.uk:443 162.19.X.X - - [21/Apr/2024:06:25:41 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 124 "-" "Wget/1.21.3" www.earth.org.uk:443 104.237.X.X - - [21/Apr/2024:06:29:35 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3421 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXX; +http://overcast.fm/)" www.earth.org.uk:80 54.200.X.X - - [21/Apr/2024:06:30:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11547 "-" "Amazon Music Podcast" INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA... 9632 102051873 ALL 2318 26378397 "Amazon Music Podcast" 1696 25305017 "iTMS" 1202 5814066 "Spotify/1.0" 640 5829187 "Podbean/FeedUpdate 2.1" INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)... 384 2484922 00 391 2633980 01 380 2889315 02 405 3060896 03 317 2103272 04 INFO: /tmp/stats.out/feedStatusByUA.log: feed hits and status by UA... 9632 102051873 200:304:406:429 6441 1207 1518 429 ALL 2318 26378397 200:304:406:429 2243 0 4 71 "Amazon Music Podcast" 1696 25305017 200:304:406:429 672 6 1014 0 "iTMS" 1202 5814066 200:304:406:429 576 588 0 38 "Spotify/1.0" 640 5829187 200:304:406:429 501 0 0 139 "Podbean/FeedUpdate 2.1" INFO: /tmp/stats.out/feedStatusByHour.log: feed hits and status by hour (UTC)... 384 2484922 200:304:406:429 191 51 90 52 00 391 2633980 200:304:406:429 199 56 89 44 01 380 2889315 200:304:406:429 229 50 92 9 02 405 3060896 200:304:406:429 251 50 96 8 03 317 2103272 200:304:406:429 167 44 86 19 04
Spotify and the 'lite' feed were added during this stats interval.
The 406
and 429
defences are trimming some waste. Most of the iTunes (iTMS
) requests are being rejected with 406
, and without a detailed check I suspect that those lonely six 304
s were actually a feed validator pretending to be a well-behaved version of it! Amazon does back off somewhat when fed 429
s, which is good, though not enough.
feedStatusByHour.log
demonstrates the waste. In this 6-day interval no new podcast episodes were added, I fiddled with (eg re-arranged) metadata at most a handful of times, there are still at most a handful of listeners, and yet the feed file was polled 9632 times, a vast majority of those unconditionally (some of which have been rejected with 406
/429
). 9632 102051873 200:304:406:429 6441 1207 1518 429 ALL
A more sensible result, for 20 imperfect clients, and one feed change per day, rather than just for less than monthly appearances of new episodes, might have been (cue dreamy music):
240 2000000 200:304:406:429 120 120 0 0 ALL
The feed file is ~100kB uncompressed and ~11kB gzip
compressed (~9kB br
compressed), and there is per-request overhead.
Here is one client, Feeder, with at least a couple of separate users, showing near-optimal behaviour:
14 80589 200:304:406:429 7 7 0 0 "SpaceCowboys Android RSS Reader / 2.6.22(307)"
Feeder got an upgrade during the sampling interval, so there is also a fab all-304
s:
3 315 200:304:406:429 0 3 0 0 "SpaceCowboys Android RSS Reader / 2.6.23(308)"
2024-04-28: Cacheing Tweak
I have modified the caching for RSS feeds to be 4h7 by default, but in the skipHours
block to be 10h7 to jump out of the skipHours
block in one go, and just before the skipHours
block starts jump right over the block to reduce wasted polls.
A feed consumer correctly following cacheing should barely need to implement skipHours
where in a single large block like this, and where the consumer is fairly continuously connected, though it remains useful declaratively.
# Set cache time, ie minimum poll interval. # Give podcast RSS and similar feeds longer expiry out of work hours. <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # This should be long enough to jump out of skipHours in one go. ExpiresByType application/rss+xml "access plus 10 hours 7 minutes" </If> <ElseIf "%{TIME_HOUR} -gt 17"> # Jump expiry right over coming skipHours block. ExpiresByType application/rss+xml "access plus 14 hours 7 minutes" </ElseIf> <Else> # Give podcast RSS and similar feeds a default expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" </Else>
Bing activity timing
In the Bing console I have tweaked most of Bing's spidering of my sites to be around noon UTC, when there is most likely to be available solar power (off-grid and grid-tied) to cover the network and CPU load directly rather than from battery.
2025-04-29: Site ExpiresDefault
Looking again at the default expiry time for the desktop site I think that it is too harsh at ~11 days, so I have modified it to be at that level in winter when conserving is key, but at ~1 day otherwise. This could be just Dec/Jan.
# Default ~1 day to optimise Cache-Control max-age to 5 digits and for HPACK. ExpiresDefault "access plus 92222 seconds" <If "%{TIME_MON} -lt 3 || %{TIME_MON} -gt 10"> # Winter ~11 days to reduce load a little. ExpiresDefault "access plus 922222 seconds" </If>
Data cacheing
Everything under /data/
has a cache life of 1 day. Much of the data is effectively immutable.
For winter I have made the default 2 days for /data/
, using similar code to the above with TIME_MON
.
As a next step I am making (usually yearly) solid .xz
, and also (typically monthly) .gz
, archives get a year's cache life:
<LocationMatch "/data/.*\.(xz|gz|zip)$"> ExpiresDefault "access plus 1 year" </LocationMatch>
% find data -name '*.xz' | wc -l 243 % find data -name '*.gz' | wc -l 1102 % find data -name '*.zip' | wc -l 9
I am considering changing the /data/
default to ~30 days, and then reducing back to a day items likely to update more often:
- any directory, or at least any called
live
- any file whose name contains the current year (or near future years)
% find data | wc -l 15713 % find data -type d | wc -l 238 % find data -name live -type d | wc -l 6 % find data -name '*2024*' | wc -l 1122 % find data -name '*2025*' | wc -l 0 % find data -name '*2026*' | wc -l 0 ...
2024-04-27: Big Bad Clients
There seem to be services for some of the biggest tech companies that cannot be bothered to allow even gzip
compression (eg Meta's facebookexternalhit
), lazily wasting oodles of bandwidth for everyone. For the Gallery I am going to disallow unconditional GETs with 406
s where compression is not accepted.
Here is a slightly interesting case in which it would have been nice to have sent a 200
...
gallery.hd.org:80 185.15.X.X - - [28/Apr/2024:12:19:21 +0000] "HEAD /_c/places-and-sights/_more2003/_more08/Turkey-Alaja-Huyuk-Hittite-temple-carving-of-two-headed-eagle-with-two-rabbits-in-its-claws-SEW.jpg.html HTTP/1.1" 406 129 "-" "IABot/2.0 (+https://meta.wikimedia.org/wiki/InternetArchiveBot/FAQ_for_sysadmins) (Checking if link from Wikipedia is broken and needs removal)"
2024-04-24: AMP Be Going Moar
While waiting for a train I am trimming a few bits of AMP crud, generating 410
s ("gone") in their place:
- the ancient experimental
/ext/e/
img
andout
bridge injworkers
- the ancient
/amp/XXX
AMP pages view in EOU
Pages with m-dot counterparts are still redirected to them, but with the redirect strength upgraded from 302
(temporary) to 301
(permanent). About 12 such 301
redirects happened in the first half an hour or so after the change... I could instead generate a 410
(gone) if the Referer
is absent, ie this appears to be a spider/bot rather than a human, but I already have robots.txt
set to forbid all spidering...
2024-04-23: Podcast Lite
I have created a stripped-back item-count-limited 'lite' version of the podcast feed beside the primary. The lite version omits videos and metadata that most readers and aggregators do not use, though they should! The uncompressed 'lite' feed is not much bigger than the compressed full feed.
96985 rss/podcast.rss 11235 rss/podcast.rssgz 9241 rss/podcast.rssbr 12937 rss/podcast-lite.rss 2879 rss/podcast-lite.rssgz 2402 rss/podcast-lite.rssbr
Slightly against my better judgement I have handed this feed to Spotify, only.
Spotify is polling about every 7 minutes, and does seem to support at least gzip
compression, and is doing conditional GET
s. The first is stupidly too fast like Amazon and Apple, the other points are good.
(Spotify previously rejected the full feed because it contained two videos. Spotify's automated systems have rejected a couple of episodes that it says are music tracks (they are indeed a couple of my short generative music clips) and thus are not in line with Spotify podcast policy. I have created a new 'music' tag to mark (and exclude) such items. This podcast may not last long on Spotify!)
I have extended the op3.dev
enclosure URL prefixes to make it clearer where any download traffic is coming from. This extended URL now also contains the source feed GUID; no personal tracking information.
I may redirect to this feed bots that might otherwise get a 406
for not supporting even gzip
compression for the primary feed.
I have adjusted the lite feed to use the lower-fi MP3 (.mp3L
) audio where available, to to be a maximum of 10 items. Definitely 'lighter' all round.
96997 rss/podcast.rss 11244 rss/podcast.rssgz 9251 rss/podcast.rssbr 9167 rss/podcast-lite.rss 2272 rss/podcast-lite.rssgz 1898 rss/podcast-lite.rssbr
(I am using the same guid
for the episode/item in the lite feed as in the main one, even though the former uses a different (lo-fi) enclosure. Maybe this is wrong...)
I have also adjusted lite feed page links to point to pages on the lite/m-dot site.
2024-04-22: lastBuildDate
I am switching the RSS feeds from using at channel (ie top) level pubDate
to lastBuildDate
.
I doubt that anything much cares, and it is still not logically entirely right, but it is probably better. (It is the more popular in extant feeds, in ~70% of RSS podcast feeds vs ~30% for pubDate
, IIRC.)
2024-04-21: ETag Be Gone!
I have disabled ETag
s for the whole of the EOU site, plus the Gallery and ExNet. I also invoke FileETag none
, which should save Apache from even calculating ETag
values.
This should enable all on-the-fly compressed material to be cached better, eg including slow-changing directory listings, and data files and logs. It may result in some cache misses from clients that present next time an If-None-Match
with no If-Modified-Since
.
To enable Last-Modified
for directories I need to add +TrackModified
to IndexOptions
. Note the caveat in the documentation that Changes to the size or date stamp of an existing file will not update the Last-Modified header on all Unix platforms.
Thus this may only reflect changes (add/delete entries) to the directory itself, which I am happy with. Changes for items in a directory should be tracked on those items; the essence of the directory is the list of entries. As a fallback I could restrict this to just /img
which should not have (many) changes to existing files.
(: I have disabled the If-None-Match
tests in the 406
and 429
rules, since we cannot use those conditionals when the site is not generating ETag
s.)
AMP be going
While trimming the size of generated error pages (such as for 404s) I also stripped out a bit of AMP complexity and cruft.
Bad wasteful bots
There seem to be wasteful clients (often "Go") that do not implement even gzip
encoding/compression. I now treat that as equivalent to Save-Data
for main/top pages, ie they cannot have the the best stuff, wasting bandwidth! They get the smaller 'lite' pages.
All bona fide browsers and mainstream search engine bots that I know of, including lynx
, support gzip
encoding/compression.
I am taking this as a prompt to make some 'lite' page versions even smaller, eg:
428905 energy-series-dataset.html 120362 m/energy-series-dataset.html
I have also taken the opportunity to slightly rationalise the code for Save-Data
, including not adding the header to Vary
unless the outcome actually depended on the presence of Save-Data
. That makes me a bit uncomfortable, ie not always adding to Vary
for appropriate objects, but it saves some bytes and is probably correct.
2024-04-20: ClaudeBot Be Gone!
When I exclude a badly-behaved bot from a site with robots.txt
, that is not an invitation for the bot to re-check every few seconds just in case I want to be friends now.
Given the log-filling and CPU-wasting nonsense below for gallery.hd.org
, I have also excluded ClaudeBot from EOU:
[20/Apr/2024:08:07:46 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "http://mirror-us-ga1.gallery.hd.org/robots.txt" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" [20/Apr/2024:08:07:47 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" [20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" [20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 301 509 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" [20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "http://mirror-us-ga1.gallery.hd.org/robots.txt" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
That is a block of successive hits on all my static sites, though omitting a couple scraping EOU...
ClaudeBot has been greedy busy on my sites lately. As has ChatGPT plus the usual cohort of seeming scrapers and spiders.
I do not particularly object to the AI usage — it is part of the reason for having semantic markup everywhere. But when greedy enough to effectively perform denial of service (DoS), eg obstructing me in my own use of my sites and logs, that is a reason for a robots.txt
ban. And those bans tend to be permanent, since it not obvious when to manually check and trim robots.txt
.
(A few hours after sending a slightly angry email about the above to the only anthropic.com
email address I could find (the press team) the ClaudeBot activity stopped.)
2024-04-19: Precompressed Podcast RSS
To save a little bandwidth, and CPU time on each fetch, the podcast RSS file now has precompressed Brotli and Gzip (zopfli
) versions generated when the RSS feed file is updated.
89914 19 Apr 17:16 rss/podcast.rss 10993 19 Apr 17:16 rss/podcast.rssgz 9001 19 Apr 17:16 rss/podcast.rssbr % gzip -6 < rss/podcast.rss | wc -c 11490
(When the precompressed versions are not available, normal on-the-fly gzip
compression by mod_deflate
remains available, equivalent to gzip -6
.)
The Apache configuration has been updated to serve the precompressed versions to capable clients.
The first log entry is before the precompressed version were (manually, this time) put in place, and the others after:
[19/Apr/2024:16:40:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12064 "-" "Amazon Music Podcast" [19/Apr/2024:16:45:20 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11550 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0" [19/Apr/2024:16:46:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11551 "-" "Amazon Music Podcast" [19/Apr/2024:17:01:50 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 9385 "-" "PodcastAddict/v5W (+https://podcastaddict.com/; Android podcast app)"
This represents an apparent ~5% saving for gzip
-capable clients, and an apparent ~25% saving for br
-capable clients. (~90% saving for the latter vs uncompressed...)
2024-04-17: CORS and ETag
I have slightly adjusted the Apache configuration to work with CORS, and drop ETag
, for all .rss
, .atom
(and .xml
and .vtt
) files, rather than everything under /rss/
. In particular this now includes /sitemap.atom
and /sitemap.xml
. I hope that this will improve cacheability (and other usability) of Atom feeds a little.
<IfModule mod_headers.c> <FilesMatch "\.(rss|vtt|atom|xml)$"> Header set access-control-allow-origin * # Avoid Apache ETag / mod_deflate bug. Header unset ETag # DHD20240413: DeflateAlterETag is unsupported for sencha. #DeflateAlterETag Remove </FilesMatch> </IfModule>
... and the first few 304
s for /sitemap.atom
have come through:
[17/Apr/2024:09:48:23 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers" [17/Apr/2024:09:54:54 +0000] "GET /sitemap.atom HTTP/1.1" 304 3374 "-" "Mozilla/5.0 (compatible; theoldreader.com; 1 subscribers; feed-id=XXXX)" [17/Apr/2024:10:10:07 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "NewsBlur Feed Fetcher - 1 subscriber - https://www.newsblur.com/site/XXXX/earth-notes-basic-feed (\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15\")" [17/Apr/2024:10:11:50 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "Feedbin feed-id:XXXX - 1 subscribers" [17/Apr/2024:10:18:35 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers"
(I also zapped a defunct ~ /sitemap.xml
for the m-dot site; everything is handled in the main site sitemap now.)
2024-04-16: RSS Stats
I built a script to gather a standard set of RSS-related stats from the last ~week of EOU logs, and that data has been captured for later, mwhahahah!
This is what a run of it looks like:
INFO: /tmp/stats.out/interval.txt: 2024-04-07T06:25:14 to 2024-04-15T06:25:10 inclusive log data INFO: hits: all 235499, site 115578, feed 9643 INFO: /tmp/stats.out/allHitsByHour.log: all hits by hour (UTC)... 9643 175368483 ALL 460 9125257 17 459 7685543 12 446 7737445 18 434 7820128 15 INFO: /tmp/stats.out/feedHits.log: RSS feed hits... www.earth.org.uk:80 34.220.118.X - - [07/Apr/2024:06:27:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11526 "-" "Amazon Music Podcast" www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS" www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS" www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 85016 "-" "iTMS" www.earth.org.uk:443 104.237.137.X - - [07/Apr/2024:06:29:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 14780 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXXXXX; +http://overcast.fm/)" INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA... 9643 175368483 ALL 2806 34128111 "Amazon Music Podcast" 2401 73456937 "iTMS" 542 6380012 "Podbean/FeedUpdate 2.1" 483 8501504 "-" INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)... 9643 175368483 ALL 460 9125257 17 459 7685543 12 446 7737445 18 434 7820128 15
2024-04-15: Greenlink support
I added support for the coming INTGRNL
'Greenlink' Irish interconnector ready for go-live on 2024-08-01
.
2024-04-14: Podcasting 2.0, TTL, 406, 429
I have added a little more Podcasting 2.0 metadata to my RSS feeds. The non-podcast feeds now include these channel
tags:
<podcast:medium>blog</podcast:medium> <podcast:location geo="geo:51.406696,-0.288789,16">16WW, Kingston-upon-Thames, UK</podcast:location> <podcast:podroll><podcast:remoteItem feedGuid="02b2185f-3173-5e6f-bdda-cc60fb797f84"/></podcast:podroll> <podcast:updateFrequency rrule="FREQ=MONTHLY">monthly</podcast:updateFrequency>
That is: medium
, location
, podroll
, updateFrequency
.
The main podcast RSS is not using medium
or podroll
, but is already using item
tags transcript
and alternateEnclosure
.
TTL
I have pushed up all the RSS feed TTL values to a little over 3 days (4327 minutes).
406 Not Acceptable
In an attempt to push back on some of the more badly-behaved bots, I have added to the 'overnight' Apache configuration block covering skipHours
:
# Reject (bot) attempts to unconditionally fetch without compression. # 406 Unacceptable. RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] RewriteCond %{HTTP:Accept-Encoding} ^$ RewriteRule "^/rss/.*\.rss$" - [L,R=406]
This is saying that an empty/missing Accept-Encoding
, eg precluding ~7x bandwidth reduction though gzip
compression, is not reasonable.
Trying it out in daylight yielded these early 406
s (yes, the last two are from Apple...):
[14/Apr/2024:13:44:31 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 148 "-" "-" [14/Apr/2024:13:44:31 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 418 "-" "-" [14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS" [14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"
Everything is now in place for when the weekly logs roll tomorrow morning!
The defensive Apache config for RSS is now:
# Allow CORS to work for RSS feeds and transcripts. # This allows browsers to access them from non-EOU pages. <IfModule mod_headers.c> <FilesMatch "\.(rss|vtt)$"> Header set access-control-allow-origin * </FilesMatch> </IfModule> # Help conditional requests work by removing the unhelpful XXX-gzip ETag. # https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag <Location /rss> Header unset ETag # DHD20240413: DeflateAlterETag is unsupported for sencha. #DeflateAlterETag Remove </Location> <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # Give podcast RSS and similar feeds longer expiry out of work hours. ExpiresByType application/rss+xml "access plus 7 hours 7 minutes" # # Reject (bot) attempts to unconditionally fetch without compression. # 406 Unacceptable. RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] RewriteCond %{HTTP:Accept-Encoding} ^$ RewriteRule "^/rss/.*\.rss$" - [L,R=406] # # For RSS files (which will have skipHours matching the above), # if there is no Referer and no conditional fetching, back off # when battery is low. # 429 Too Many Requests RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1] Header always set Retry-After "25620" env=RSS_RATE_LIMIT </If> <Else> # Give podcast RSS and similar feeds an expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" </Else>
428 Precondition Required
is an alternative plausible status in place of 406
or 429
, though any client has to be able to make the first fetch and, at least occasionally, a unconditional fetch.
2024-04-15: no go
Oh dear, that did not seem to be generating 406
s at ~05:00Z. Reformulated:
# Allow CORS to work for RSS feeds and transcripts. # This allows browsers to access them from non-EOU pages. <IfModule mod_headers.c> <FilesMatch "\.(rss|vtt)$"> Header set access-control-allow-origin * </FilesMatch> </IfModule> # Help conditional requests work by removing the unhelpful XXX-gzip ETag. # https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag <Location /rss> Header unset ETag # DHD20240413: DeflateAlterETag is unsupported for sencha. #DeflateAlterETag Remove </Location> # Reject (bot) attempts to unconditionally fetch without compression. # 406 Unacceptable. RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] #RewriteCond %{HTTP:Accept-Encoding} ^$ [OR] RewriteCond %{HTTP:Accept-Encoding} !gzip RewriteRule "^/rss/.*\.rss$" - [L,R=406] # # For RSS files (which will have skipHours matching the above), # if there is no Referer and no conditional fetching, back off # when battery is low. # 429 Too Many Requests RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1] Header always set Retry-After "25620" env=RSS_RATE_LIMIT <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # Give podcast RSS and similar feeds longer expiry out of work hours. ExpiresByType application/rss+xml "access plus 7 hours 7 minutes" </If> <Else> # Give podcast RSS and similar feeds an expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" </Else>
For the 406
case I now reject a lack of gzip
support, not just an empty/missing Accept-Encoding
header.
Sample rejections (which stopped by 08:00Z as intended):
[15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1" [15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1" [15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS" [15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS" [15/Apr/2024:07:10:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 584 "-" "taddy.org/developers 1.0" [15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS" [15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"
A small tweak to the 406
part will reject non-compressed fetches when the GB grid has high intensity compared to the last week, since the Internet upstream of me is at least in part GB-grid powered.
-RewriteCond "%{TIME_HOUR}" ">21" +RewriteCond "%{TIME_HOUR}" ">21" [OR] +RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f
Another line could reject non-compressed fetches if local battery was low, though doing compression may cost more CPU and battery than encrypting the longer non-compressed response, if I do not pre-compress them.
Providing pre-compressed Brotli RSS feed versions might (from a quick test) save ~20% bandwidth for unconditional transfers, and for when there is a feed change. But cutting the number of unconditional polls would save much more bandwidth. (Note that any byte saving is diminished by https
overheads.)
I estimate that ~50% of 'bad' unconditional requests without compression support will be rejected with 406
s.
More 429
For the 429 case I have added an "if GB grid intensity is high" ORed with the existing "if battery is low" clause.
RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] +# Have any interaction with the filesystem as late as possible. +RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f [OR] RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1] Header always set Retry-After "25620" env=RSS_RATE_LIMIT
So if during skipHours
an unconditional feed request is made and either of those is the case, the client will now get a 429
. So Amazon, Apple, PodBean, and Deezer will be getting more 429
s in their futures. Let us see if my feed is dropped, I receive a complaint, or an intrigued engineer works out what is going on and improves things for all parties. I would like the last, but do not hold out too much hope!
2024-04-19: since high up in a feed-puller bad-boys list comes some anonymous thing(s) (with no User-Agent
), I have added a clause to treat that as a sin on a par with low battery and high grid carbon intensity during skipHours
:
RewriteCond %{HTTP:If-None-Match} ^$ [NV] +# Not saying who you are (no User-Agent) and ignoring skipHours is rude. +RewriteCond %{HTTP:User-Agent} ^$ [NV,OR]
2024-04-29: also now for 406
no User-Agent
) no User-Agent
is on a par high grid carbon intensity, for no-Referer
unconditional requests not allowing compression:
RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" [OR] +# Not saying who you are (no User-Agent) and not allowing compression is rude. +RewriteCond %{HTTP:User-Agent} ^$ [NV,OR]
2024-04-16: 406 and 429 custom error pages
(This evening, now that GB grid intensity is relatively high vs the last 7 days, my server is starting to reject some of the clownishly-bad RSS feed polling, eg by iTunes: ~1000x too often, ignoring Cache-Control
, with no If-None-Match
, no If-Modified-Since
, and no Accept-Encoding
to allow a gzip
~7x bytes saving. Come on Apple, you can engineer better than this!)
To try and give that intrigued engineer a clue, I have added custom error pages for 406
and 429
, with helpful pointers. I may have to update these as and when I update my defences...
Here is the current 406
text:
406: Not Acceptable
Bad request Accept headers
Please:
- allow at least
gzip
compression inAccept-Encoding
- where possible use conditional requests with
If-None-Match
orIf-Modified-Since
- where possible honour
Cache-Control
orExpires
and similar refresh hints such as RSSskipHours
; help save bandwidth, CPU and climate
Small irony: the new messages are a couple of hundred bytes longer on the wire each (less than 10%, given https
overheads), especially given that compression is often not being supported! I am trimming them (and all noindex
pages) a little. Almost none will be read by humans, so elegant prose is largely wasted!
Log-of-shame sample:
[16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" [16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS" [16/Apr/2024:18:30:28 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" [16/Apr/2024:18:30:29 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS" [16/Apr/2024:18:34:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 5101 "-" "Podchaser (https://www.podchaser.com)" [16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" [16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS" [16/Apr/2024:18:45:52 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 309 "-" "-" [16/Apr/2024:18:45:52 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 1875 "-" "-" [16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" [16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS" [16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS" [16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
2024-04-13: If-None-Match
The Feeder podcast reader is paying attention to HTTP cache control, but although it is apparently using If-None-Match
it is not seeing 304
results.
The Apache 2.4 mod_deflate DeflateAlterETag
documentation points out that the new AddSuffix
default prevents serving "HTTP Not Modified" (304) responses to conditional requests for compressed content
.
This does not affect my pre-compressed Gzip and Brotli page responses which correctly serve an ETag
based on the actual file served, ie different for the uncompressed, Gzip and Brotli response variants.
I am trying to fix this by removing the unhelpful XXX-gzip
ETag
for these feed files. Header unset ETag
is used because DeflateAlterETag Remove
is unsupported in my server.
<Location /rss> Header unset ETag </Location>
I have added the same Header unset ETag
for stuff under /img
since If-Modified-Since
should be enough (no races possible) for immutable content. A slightly better workaround might be RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'
to allow ETag
s to work again as intended.
This is effectively an Apache 2.4 mod_deflate
ETag
bug I think; the ETag
should be modified for the compressed variant, but that modified tag should be correctly matched for a subsequent conditional request.
(The DeflateAlterETag Remove
should be used rather than Header unset ETag
, to avoid losing ETag
where they may still be helpful such as on audio and image files.)
This seems to have increased the number of 304
s, and the variety of clients getting them, from a trailing sample:
[14/Apr/2024:05:16:40 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" [14/Apr/2024:05:19:59 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 222 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0" [14/Apr/2024:05:20:54 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1" [14/Apr/2024:05:46:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)" [14/Apr/2024:05:54:31 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)" [14/Apr/2024:06:19:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Wget/1.21.3" [14/Apr/2024:07:01:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)" [14/Apr/2024:07:04:05 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)" [14/Apr/2024:07:04:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 167 "-" "Aggrivator (PodcastIndex.org)/v0.1.7" [14/Apr/2024:07:07:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1" ... [14/Apr/2024:08:42:11 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
That last is possibly the first-ever 304
for SpaceCowboys / Feeder, which uses OkHttp.
2024-04-11: Moar Transcripts
I am making my way through the remaining missing WebVTT transcripts!
(The last three were hammered out the following morning, first thing...)
2024-04-09: Like and Subscribe Boilerplate
I have added standard like-and-subscribe (and "here are some podcast players") links to each normal desktop podcast page (as an aside
). The same information is added to the main podcast section page also.
2024-04-04: Podcast Episode SQTNs
Since the Feeder podcast app seems as if it will show them, I have begun adding some square 'thumbnail' images to selected podcast episodes. They will be added to the RSS podcast feed as item
(ie episode) itunes:image
s. Probably not big enough to technically meet Apple's spec. I have made sure that there is at least a lo-fi .jpgL
/ .pngL
version of each such image so that non-smart readers presenting no Referer
will eat less bandwidth.
These will not be visible on the podcast pages.
Podcast episode text icons
I am creating a set of standard cover 'art' icons with text to png converter
400x400, horizontally and vertically centred, Helvetica 96px, black on white.
Transcripts on Apple Podcasts
The WebVTT transcripts that I have provided are visible in the macOS Podcasts application on my MacBook Air now.
They do not seem to do anything very useful, eg highlight the current text, but they are there.
"Automatically generated" transcripts seem to work too, though are completely blank for pure music, eg not even a [MUSIC]
!
I see that in one case the automated transcription cleverly linked up a spoken domain name, EOU in this case.
: finished the last of the 60 podcast transcripts.
2024-04-02: ORCID Byline
For those articles that I have flagged as 'research' an ORCID logo linked to my record . is now being added to the by-line.
I have copied the appropriate small logo to the EOU site so as not to add load (or inadvertent tracking) to the main ORCID site.
The original does not seem to be efficiently compressed, though my copy now is, so there is a bunch more wasted bandwidth...
% zopflipng -m -m ~/Downloads/5008697/ORCID-iD_icon-16x16.png img/3rdParty/ORCID-iD_icon-16x16.png Optimizing /Users/dhd/Downloads/5008697/ORCID-iD_icon-16x16.png Input size: 1261 (1K) Result size: 218 (0K). Percentage of original: 17.288% Result is smaller
RSS work storage
I have adjusted the makefile to avoid rebuilding the RSS feed files if the 24h GB grid intensity is high/red because updated files may result in more Internet traffic (200s, not 304s). Parts of the Internet traffic near me use that GB grid power.
Also the local power status has to be HIGH for most RSS feeds to be rebuilt, and not LOW for the podcast RSS feed file to be.
% ls -al _gridCarbonIntensityGB.red.flag 0 Apr 2 05:31 _gridCarbonIntensityGB.red.flag % make rss/*.built make: Nothing to be done for 'rss/note-on-site-technicals.rss.built'. make: Nothing to be done for 'rss/podcast.rss.built'. make: Nothing to be done for 'rss/saving-electricity.rss.built'.
Today it has been red since 05:31Z (~6:30am), up until ~9pm so far. So this may need to be relaxed a little. The feed can easily be manually built with the script if need be.
I have applied similar build restrictions to other 'feed' files.
This is a form of work storage or deferral until better times.
(See previous work storage note.)