Earth Notes: On Website Technicals (2024-04)

Updated 2024-04-17.
Tech updates: ORCID, RSS feed work storage, podcast episode images, transcripts, like and subscribe, Apache 2.4 ETag bug, 406 and more 429...
I am struggling a bit with progressing my PhD currently, but now I have global RSS efficiency as a new side-quest to ensure that I remain appropriately distracted...

2024-04-17: CORS and ETag

I have slightly adjusted the Apache configuration to work with CORS, and drop ETag, for all .rss, .atom (and .xml and .vtt) files, rather than everything under /rss/. In particular this now includes /sitemap.atom and /sitemap.xml. I hope that this will improve cacheability (and other usability) of Atom feeds a little.

<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt|atom|xml)$">
    Header set access-control-allow-origin *
    # Avoid Apache ETag / mod_deflate bug.
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
  </FilesMatch>
</IfModule>

... and the first few 304s for /sitemap.atom have come through:

[17/Apr/2024:09:48:23 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers"
[17/Apr/2024:09:54:54 +0000] "GET /sitemap.atom HTTP/1.1" 304 3374 "-" "Mozilla/5.0 (compatible; theoldreader.com; 1 subscribers; feed-id=XXXX)"
[17/Apr/2024:10:10:07 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "NewsBlur Feed Fetcher - 1 subscriber - https://www.newsblur.com/site/XXXX/earth-notes-basic-feed (\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15\")"
[17/Apr/2024:10:11:50 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "Feedbin feed-id:XXXX - 1 subscribers"
[17/Apr/2024:10:18:35 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers"

(I also zapped a defunct ~ /sitemap.xml for the m-dot site; everything is handled in the main site sitemap now.)

2024-04-16: RSS Stats

I built a script to gather a standard set of RSS-related stats from the last ~week of EOU logs, and that data has been captured for later, mwhahahah!

This is what a run of it looks like:

INFO: /tmp/stats.out/interval.txt: 2024-04-07T06:25:14 to 2024-04-15T06:25:10 inclusive log data
INFO: hits: all 235499, site 115578, feed 9643
INFO: /tmp/stats.out/allHitsByHour.log: all hits by hour (UTC)...
9643 175368483 ALL
460 9125257 17
459 7685543 12
446 7737445 18
434 7820128 15
INFO: /tmp/stats.out/feedHits.log: RSS feed hits...
www.earth.org.uk:80 34.220.118.X - - [07/Apr/2024:06:27:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11526 "-" "Amazon Music Podcast"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 85016 "-" "iTMS"
www.earth.org.uk:443 104.237.137.X - - [07/Apr/2024:06:29:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 14780 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXXXXX; +http://overcast.fm/)"
INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA...
9643 175368483 ALL
2806 34128111 "Amazon Music Podcast"
2401 73456937 "iTMS"
542 6380012 "Podbean/FeedUpdate 2.1"
483 8501504 "-"
INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)...
9643 175368483 ALL
460 9125257 17
459 7685543 12
446 7737445 18
434 7820128 15

2024-04-15: Greenlink support

I added support for the coming INTGRNL 'Greenlink' Irish interconnector ready for go-live on 2024-08-01.

2024-04-14: Podcasting 2.0, TTL, 406

I have added a little more Podcasting 2.0 metadata to my RSS feeds. The non-podcast feeds now include these channel tags:

<podcast:medium>blog</podcast:medium>
<podcast:location geo="geo:51.406696,-0.288789,16">16WW, Kingston-upon-Thames, UK</podcast:location>
<podcast:podroll><podcast:remoteItem feedGuid="02b2185f-3173-5e6f-bdda-cc60fb797f84"/></podcast:podroll>
<podcast:updateFrequency rrule="FREQ=MONTHLY">monthly</podcast:updateFrequency>

That is: medium, location, podroll, updateFrequency.

The main podcast RSS is not using medium or podroll, but is already using item tags transcript and alternateEnclosure.

TTL

I have pushed up all the RSS feed TTL values to a little over 3 days (4327 minutes).

406 Not Acceptable

In an attempt to push back on some of the more badly-behaved bots, I have added to the 'overnight' Apache configuration block covering skipHours:

# Reject (bot) attempts to unconditionally fetch without compression.
# 406 Unacceptable.
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
RewriteCond %{HTTP:Accept-Encoding} ^$
RewriteRule "^/rss/.*\.rss$" - [L,R=406]

This is saying that an empty/missing Accept-Encoding, eg precluding ~7x bandwidth reduction though gzip compression, is not reasonable.

Trying it out in daylight yielded these early 406s (yes, the last two are from Apple...):

[14/Apr/2024:13:44:31 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 148 "-" "-"
[14/Apr/2024:13:44:31 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 418 "-" "-"
[14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"

Everything is now in place for when the weekly logs roll tomorrow morning!

The defensive Apache config for RSS is now:

# Allow CORS to work for RSS feeds and transcripts.
# This allows browsers to access them from non-EOU pages.
<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt)$">
    Header set access-control-allow-origin *
  </FilesMatch>
</IfModule>
# Help conditional requests work by removing the unhelpful XXX-gzip ETag.
# https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag
<Location /rss>
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
</Location>
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # Give podcast RSS and similar feeds longer expiry out of work hours.
    ExpiresByType application/rss+xml "access plus 7 hours 7 minutes"
    #
    # Reject (bot) attempts to unconditionally fetch without compression.
    # 406 Unacceptable.
    RewriteCond %{HTTP_REFERER} ^$
    RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
    RewriteCond %{HTTP:If-None-Match} ^$ [NV]
    RewriteCond %{HTTP:Accept-Encoding} ^$
    RewriteRule "^/rss/.*\.rss$" - [L,R=406]
    #
    # For RSS files (which will have skipHours matching the above),
    # if there is no Referer and no conditional fetching, back off
    # when battery is low.
    # 429 Too Many Requests
    RewriteCond %{HTTP_REFERER} ^$
    RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
    RewriteCond %{HTTP:If-None-Match} ^$ [NV]
    RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
    RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
    Header always set Retry-After "25620" env=RSS_RATE_LIMIT
</If>
<Else>
    # Give podcast RSS and similar feeds an expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
</Else>

428 Precondition Required is an alternative plausible status in place of 406 or 429, though any client has to be able to make the first fetch and, at least occasionally, a unconditional fetch.

2024-04-15: no go

Oh dear, that did not seem to be generating 406s at ~05:00Z. Reformulated:

# Allow CORS to work for RSS feeds and transcripts.
# This allows browsers to access them from non-EOU pages.
<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt)$">
    Header set access-control-allow-origin *
  </FilesMatch>
</IfModule>
# Help conditional requests work by removing the unhelpful XXX-gzip ETag.
# https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag
<Location /rss>
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
</Location>
# Reject (bot) attempts to unconditionally fetch without compression.
# 406 Unacceptable.
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
#RewriteCond %{HTTP:Accept-Encoding} ^$ [OR]
RewriteCond %{HTTP:Accept-Encoding} !gzip
RewriteRule "^/rss/.*\.rss$" - [L,R=406]
#
# For RSS files (which will have skipHours matching the above),
# if there is no Referer and no conditional fetching, back off
# when battery is low.
# 429 Too Many Requests
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
Header always set Retry-After "25620" env=RSS_RATE_LIMIT
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
# Give podcast RSS and similar feeds longer expiry out of work hours.
ExpiresByType application/rss+xml "access plus 7 hours 7 minutes"
</If>
<Else>
# Give podcast RSS and similar feeds an expiry time of ~4h.
ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
</Else>

For the 406 case I now reject a lack of gzip support, not just an empty/missing Accept-Encoding header.

Sample rejections (which stopped by 08:00Z as intended):

[15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1"
[15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1"
[15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"
[15/Apr/2024:07:10:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 584 "-" "taddy.org/developers 1.0"
[15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"

A small tweak to the 406 part will reject non-compressed fetches when the GB grid has high intensity compared to the last week, since the Internet upstream of me is at least in part GB-grid powered.

-RewriteCond "%{TIME_HOUR}" ">21"
+RewriteCond "%{TIME_HOUR}" ">21" [OR]
+RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f

Another line could reject non-compressed fetches if local battery was low, though doing compression may cost more CPU and battery than encrypting the longer non-compressed response, if I do not pre-compress them.

Providing pre-compressed Brotli RSS feed versions might (from a quick test) save ~20% bandwidth for unconditional transfers, and for when there is a feed change. But cutting the number of unconditional polls would save much more bandwidth. (Note that any byte saving is diminished by https overheads.)

I estimate that ~50% of 'bad' unconditional requests without compression support will be rejected with 406s.

More 429

For the 429 case I have added an "if GB grid intensity is high" or-ed with the existing "if battery is low" clause.

 RewriteCond %{HTTP_REFERER} ^$
 RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
 RewriteCond %{HTTP:If-None-Match} ^$ [NV]
+# Have any interaction with the filesystem as late as possible.
+RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f [OR]
 RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
 RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
 Header always set Retry-After "25620" env=RSS_RATE_LIMIT

So if during skipHours an unconditional feed request is made and either of those is the case, the client will now get a 429. So Amazon, Apple, PodBean, and Deezer will be getting more 429s in their futures. Let us see if my feed is dropped, I receive a complaint, or an intrigued engineer works out what is going on and improves things for all parties. I would like the last, but do not hold out too much hope!

2024-04-16: 406 and 429 custom error pages

(This evening, now that GB grid intensity is relatively high vs the last 7 days, my server is starting to reject some of the clownishly-bad RSS feed polling, eg by iTunes: ~1000x too often, ignoring Cache-Control, with no If-None-Match, no If-Modified-Since, and no Accept-Encoding to allow a gzip ~7x bytes saving. Come on Apple, you can engineer better than this!)

To try and give that intrigued engineer a clue, I have added custom error pages for 406 and 429, with helpful pointers. I may have to update these as and when I update my defences...

Here is the current 406 text:

406: Not Acceptable

Bad request Accept headers

Please:

  • allow at least gzip compression in Accept-Encoding
  • where possible use conditional requests with If-None-Match or If-Modified-Since
  • where possible honour Cache-Control or Expires and similar refresh hints such as RSS skipHours; help save bandwidth, CPU and climate

Small irony: the new messages are a couple of hundred bytes longer on the wire each (less than 10%, given https overheads), especially given that compression is often not being supported! I am trimming them (and all noindex page)a little. Almost none will be read by humans, so elegant prose is almost entirely wasted!

Log-of-shame sample:

[16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:30:28 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:30:29 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:34:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 5101 "-" "Podchaser (https://www.podchaser.com)"
[16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:45:52 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 309 "-" "-"
[16/Apr/2024:18:45:52 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 1875 "-" "-"
[16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"

2024-04-13: If-None-Match

The Feeder podcast reader is paying attention to HTTP cache control, but although it is apparently using If-None-Match it is not seeing 304 results.

The Apache 2.4 mod_deflate DeflateAlterETag documentation points out that the new AddSuffix default prevents serving "HTTP Not Modified" (304) responses to conditional requests for compressed content.

This does not affect my pre-compressed Gzip and Brotli page responses which correctly serve an ETag based on the actual file served, ie different for the uncompressed, Gzip and Brotli response variants.

I am trying to fix this by removing the unhelpful XXX-gzip ETag for these feed files. Header unset ETag is used because DeflateAlterETag Remove is unsupported in my server.

<Location /rss>
    Header unset ETag
</Location>

I have added the same Header unset ETag for stuff under /img since If-Modified-Since should be enough (no races possible) for immutable content. A slightly better workaround might be RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"' to allow ETags to work again as intended.

This is effectively an Apache 2.4 mod_deflate ETag bug I think; the ETag should be modified for the compressed variant, but that modified tag should be correctly matched for a subsequent conditional request.

(The DeflateAlterETag Remove should be used rather than Header unset ETag, to avoid losing ETag where they may still be helpful such as on audio and image files.)

This seems to have increased the number of 304s, and the variety of clients getting them, from a trailing sample:

[14/Apr/2024:05:16:40 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[14/Apr/2024:05:19:59 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 222 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0"
[14/Apr/2024:05:20:54 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1"
[14/Apr/2024:05:46:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
[14/Apr/2024:05:54:31 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)"
[14/Apr/2024:06:19:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Wget/1.21.3"
[14/Apr/2024:07:01:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
[14/Apr/2024:07:04:05 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)"
[14/Apr/2024:07:04:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 167 "-" "Aggrivator (PodcastIndex.org)/v0.1.7"
[14/Apr/2024:07:07:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1"
...
[14/Apr/2024:08:42:11 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"

That last is possibly the first-ever 304 for SpaceCowboys / Feeder, which uses OkHttp.

2024-04-11: Moar Transcripts

I am making my way through the remaining missing WebVTT transcripts!

(The last three were hammered out the following morning, first thing...)

2024-04-09: Like and Subscribe Boilerplate

I have added standard like-and-subscribe (and "here are some podcast players") links to each normal desktop podcast page (as an aside). The same information is added to the main podcast section page also.

2024-04-04: Podcast Episode SQTNs

Since the Feeder podcast app seems as if it will show them, I have begun adding some square 'thumbnail' images to selected podcast episodes. They will be added to the RSS podcast feed as item (ie episode) itunes:images. Probably not big enough to technically meet Apple's spec. I have made sure that there is at least a lo-fi .jpgL / .pngL version of each such image so that non-smart readers presenting no Referer will eat less bandwidth.

These will not be visible on the podcast pages.

Podcast episode text icons

I am creating a set of standard cover 'art' icons with text to png converter 400x400, horizontally and vertically centred, Helvetica 96px, black on white.

Transcripts on Apple Podcasts

The WebVTT transcripts that I have provided are visible in the macOS Podcasts application on my MacBook Air now.

They do not seem to do anything very useful, eg highlight the current text, but they are there.

"Automatically generated" transcripts seem to work too, though are completely blank for pure music, eg not even a [MUSIC]!

I see that in one case the automated transcription cleverly linked up a spoken domain name, EOU in this case.

2024-04-02: ORCID Byline

For those articles that I have flagged as 'research' an ORCID logo linked to my record ORCID logo. is now being added to the by-line.

I have copied the appropriate small logo to the EOU site so as not to add load (or inadvertent tracking) to the main ORCID site.

The original does not seem to be efficiently compressed, though my copy now is, so there is a bunch more wasted bandwidth...

% zopflipng -m -m ~/Downloads/5008697/ORCID-iD_icon-16x16.png img/3rdParty/ORCID-iD_icon-16x16.png
Optimizing /Users/dhd/Downloads/5008697/ORCID-iD_icon-16x16.png
Input size: 1261 (1K)
Result size: 218 (0K). Percentage of original: 17.288%
Result is smaller

RSS work storage

I have adjusted the makefile to avoid rebuilding the RSS feed files if the 24h GB grid intensity is high/red because updated files may result in more Internet traffic (200s, not 304s). Parts of the Internet traffic near me use that GB grid power.

Also the local power status has to be HIGH for most RSS feeds to be rebuilt, and not LOW for the podcast RSS feed file to be.

% ls -al _gridCarbonIntensityGB.red.flag
0 Apr  2 05:31 _gridCarbonIntensityGB.red.flag
% make rss/*.built
make: Nothing to be done for 'rss/note-on-site-technicals.rss.built'.
make: Nothing to be done for 'rss/podcast.rss.built'.
make: Nothing to be done for 'rss/saving-electricity.rss.built'.

Today it has been red since 05:31Z (~6:30am), up until ~9pm so far. So this may need to be relaxed a little. The feed can easily be manually built with the script if need be.

I have applied similar build restrictions to other 'feed' files.

This is a form of work storage or deferral until better times.

(See previous work storage note.)