Earth Notes: On Website Technicals (2017/08)

Tech updates: Atom sitemaps (un)pending, Googlebot bandwidth, HTML improvements, big beast hunting with regex, heroes, Cache-Control, restart drill, minifying.

2017/08/29: Sitemap Stealth and Protocol Relative

Just because I can I am performing a little experiment with sitemaps. I'll disclose results when I have any!

Also I'm doing a little cleanup today. Nearly all manual-text absolute links of the form, ie that need to work between www and m versions of the site, such as for static images referred to by the mobile site without redirection, are being made protocol-relative, ie // Should save a few bytes of page now, and make any transition to, or parallel support of, https a little less painful.

Not quite all in fact, specifically not those in HTML head link and meta tags and in JSON-LD markup in particular.

(A few were absolute when they could be entirely relative: now fixed...)

Also I have cleaned up a couple of references to immutable stuff (mainly images) under /img/ not using the correct static site URL, and thus forcing a redirect for mobile users. Even though I'd been careful, this seemed to include a couple of in-line img URLs, thus delaying page display.

2017/08/28: Last Updated

I find it very frustrating when informational documents are undated, from scientific papers to blog pages; lots of facts need to be set in date context. Partly to keep Googlebot happy I already put the first publication date and last update in the footer of each page, but I have just added the last update date to the floating nav bar for desktop as well, indicated with a unicode 'pencil' symbol (✎ ✎).

2017/08/25: HTML vs XHTML, Node.js, HTMLMinifier

I would like to keep the raw hand-written page source HTML for this site in the more restrictive and tag-semantics-agnostic XHTML format where possible, eg li opening tags balanced with closing ones, modulo small tweaks such as removing a few attribute quotes where those bytes may make a significant difference. Why? Future-proofing for changes in HTML, and for the possibility of quick syntax checking (eg for unbalanced/missing tags) with a simple plain XML parser.

However, the output is definitely intended for HTML5 browser consumption, ie is not XHTML, so the automatically-generated wrapper output need not do things the XHTML way. In particular dropping the trailing slash on auto-generated void tags such as meta should save a few bytes, and can always be redone for a different (XHTML) consumer in essentially one place, rather than all over tens or hundreds of manually-maintained HTML source files.

Use of an aggressive (but well-tested) HTML minifier such as the Kangax HTMLMinifier where available can automatically and consistently post-process the extra bytes out of the hand-written HTML as it is wrapped. Such a minifier may also be able to make other changes that may further help a compressor such as gzip by sorting attributes and classes.

Here are sample 'before' file sizes (gz versions c/o zopfli):

18255 index.html
 7275 index.htmlgz
15595 m/index.html
 6327 m/index.htmlgz
72086 m/note-on-Raspberry-Pi-setup.html
26326 m/note-on-Raspberry-Pi-setup.htmlgz
73370 note-on-Raspberry-Pi-setup.html
26819 note-on-Raspberry-Pi-setup.htmlgz

After adjusting the wrappers to be fairly minimal HTML (not XHTML):

18224 index.html
 7268 index.htmlgz
15568 m/index.html
 6320 m/index.htmlgz
72062 m/note-on-Raspberry-Pi-setup.html
26318 m/note-on-Raspberry-Pi-setup.htmlgz
73342 note-on-Raspberry-Pi-setup.html
26808 note-on-Raspberry-Pi-setup.htmlgz

So not a huge difference, but not a negative one, and seems intellectually clear and consistent!

Node.js / Npm

To test the Kangax HTMLMinifier I am trying it off-line on my Mac first, starting by installing node.js and npm as: Installing Node.js Tutorial: macOS, then HTMLMinifier itself for command-line use with npm install html-minifier -g.

My current RPi Raspbian seems only to support an ancient node.js/npm that does not support HTMLMinifier, but maybe I can resolve that later, if the savings (especially after zopfli) seem worthwhile.

Using a set of options that should work well for this site's typical raw and wrapper HTML shows a ~3% saving in uncompressed HTML size, nearly as great a zopfli over gzip for the compressed HTML:

% wc -c index.html
   18224 index.html
% /usr/local/bin/html-minifier --html5 --collapse-whitespace --minify-css --remove-attribute-quotes --remove-optional-tags --remove-redundant-attributes --remove-script-type-attributes --remove-style-link-type-attributes --sort-class-name index.html | wc -c

(I have played with tuning the flags in the script that does the site minifying a little more since then.)

However, the minifier and zopfli are eliminating much of the same redundancy, so the effects cannot be expected to be cumulative in general, but here the compressed output is ~2% smaller also. Win-win!

18224 index.html
 7268 index.htmlgz
17700 index.min.html
 7094 index.min.htmlgz

Node 6.x on RPi 2?

The version of node that I now have on my Mac is v6.11.2.

From looking at Beginner’s Guide to Installing Node.js on a Raspberry Pi - Install Node.js, I suspect that this may be good for v6.x in principle:

curl -sL | sudo -E bash -
sudo apt install nodejs
sudo apt-get install npm
sudo npm install npm

Piping stuff straight from the network to a root shell is unwise, so better to download it, see what it's up to, edit it down, etc, before letting it loose. Anyhow, the setup_6.x file does seem to be available.

And the sudo apt install nodejs above seems not to put node on the path, and to still have fetched an ancient version of nodejs (v0.6.19).

The installation suggested at Setup Node.js on Raspberry Pi 2 B looked a bit more hopeful to me, so after the nth go of:

apt-get purge nodejs npm

I tried his formula:

sudo dpkg -i node_latest_armhf.deb

But when I tried to run node -v to check the version installed I got complaints about missing dependencies:

% node -v
node: /usr/lib/arm-linux-gnueabihf/ version `GLIBCXX_3.4.20' not found (required by node)
node: /lib/arm-linux-gnueabihf/ version `GLIBC_2.16' not found (required by node)

so I cleaned up with:

sudo dpkg -r node

Examining the page behind the download shows that it is basically now forcing an upgrade of the entire OS from wheezy to jessie, which is somewhat heavy-handed...

However, does work and shows version v0.12.6, which may be an advance, though still 6 major versions behind, so I removed it again.

Success: Download

I revisted the Download page, where my Mac package had come from, and selected the Linux ARMv7 binary.

I manually copied the contents of the supplied lib directory under /usr/local/lib and copied node into place under /usr/local/bin, and linked npm to ../lib/node_modules/npm/bin/npm-cli.js. I patched up permissions manually. I left share alone for the time being.

Running node -v returns a happy v6.11.2.

Attempting to get npm to update itself with npm update npm -g fails with "not a package" which is fair enough.

After cleaning up a bad value I'd set for the NPM repo in its registry:

% sudo npm set registry
% sudo npm config set registry

I then was able to install html-minifier (version 3.5.3), hurrah!

% sudo npm install html-minifier -g

Mobile pages are now minified after being wrapped, and looking again at a sample of pages sizes noting that the mobile html and htmlgz files above were 72062 and 26318 bytes respectively:

69820 m/note-on-Raspberry-Pi-setup.html
25811 m/note-on-Raspberry-Pi-setup.htmlgz
73436 note-on-Raspberry-Pi-setup.html
26831 note-on-Raspberry-Pi-setup.htmlgz

Thus a couple of percent off the compressed size, as most clients will fetch, and more (~3%) off the uncompressed size.

Minifying a page is taking at least a couple of seconds on the RPi, so I'm now speding several seconds per page wrapping, minifying and compressing each off-line to minimise delays when downloading.

For now the full minification step is only being applied to mobile pages to save time, and in case anything breaks!

Note: I have to add /usr/local/bin to PATH before calling html-minifier from a cron-driven task, as evidenced by a slew of failed minifications in my log file!

2017/08/22: HTML Minifying

I've been working on reducing the size of my very largest HTML pages, compressed and uncompressed, to reduce maximum download time.

I've also introduced some pagination where logical, tied together with prev and next links.

A work in progress is using CSS to replace program-generated and highly-repetitive inline styles; that has knocked maybe 15% off the uncompressed page size, though has had relatively little impact on the compressed size thanks to the wonders of zlib/zopfli. Not done yet. This one will probably get some pagination too...

I've squeezed a few more bytes out of the HTML head in particular, by aggressively unquoting HTML tag attribute values that don't need it (essentially that do not contain any of space, > ' ' or quotes.

2017/08/19: GSC Atom Sitemaps Delay

It seems to take as much as a week or two for changes in the "Indexed" column under GSC -> Crawl -> Sitemaps to come through, and Atom sitemaps seem especially slow for me. This may be on purpose by Google to stop people getting too accurate a view of which exact URLs are in the index at any given moment. In any case I'm pleased to see that most of the things that I expect to be index, by count, are.

In contrast, the "Submitted" column updates virtually instantly after (say) an external ping of a sitemap; a refresh of the page immediately after the ping shows the new value.

2017/08/18: Restart Drill

We practiced the semi-annual-ish emergency 'smart-hands' power-cycling-because-the-server-seems-dead thing today, and it worked! As does my hyphen key.

2017/08/17: Front-page Cache-Control

Since the home pages (both m and www versions) now have some dynamic content, ie may change even when I don't explicitly edit them, I've adjusted the Apache config to set their expiry (Cache-Control, and on www Expires also) to about a day (geekery alert: with a value relatively efficient to transmit for HTTP/1.1 and HTTP/2 HPACK):

<LocationMatch "/index.html">
ExpiresDefault "access plus 92221 seconds"

Note that the match is on /index.html even though the externally-visible URL tail is /.

I made a couple of other tweaks including reducing the expiry time on the top-level _* status pages to 10 minutes to match their typical re-generation interval.

2017/08/15: Front-page Heroes

Now the front page is automatically adorned with a responsive 3, 2 or 1 column box of Newest/Popular/Updated articles. The first available hero image for each is shown.

While there are general byte weight limits for hero images for desktop and mobile, for mobile a hero image is also not allowed to be heavier than the (uncompressed) HTML of the page it is for, to help keep pages loading quickly over the air.

2017/08/14: Zopfli Faster

I am liking my zopfli and zopflipng, but they are eating significant time. Boosting zopfli performance shows some ways to get significant speed-ups, not all requiring code changes.

As a simple test I have adjusted the stock Makefile, for each of the C and C++ compilations (on the RPi) changing -O2 to -O3 and adding the flag -DNDEBUG=1 to try to disable assert()s, and -flto -fuse-linker-plugin to try to encourage cross-file optimisation. I certainly seem to get some extra warnings implying that gcc/g++ is looking rather harder at the code.

I did a trivial test in the build directory to check that the zopfli output is not completely broken (no diff output is good):

zopfli -c Makefile | gzip -d | diff - Makefile

An initial quick attempt to time old vs new was inconclusive because the variability between runs seemed to bigger the than the difference between the old and new zopfli!

Similarly, a test of zopflipng old vs new on a reasonable size (already zopflipng-ed) PNG yielded very close times, maybe 25.5s vs 23.5s, but with a lot of jitter, so possibly not worth the risk of broken outputs!

2017/08/13: Holding out for a Hero

I've added code to choose a suitable hero image for each page (based on the declared og:image) that is not explicitly providing one (or another image near the top of the page that might clash). The algorithm looks for the largest image version (amongst different -nNNNw versions) that isn't too heavy. There are some fixed weight (ie file size) limits for desktop and mobile for what is definitely too heavy, and a preferred limit. On mobile also the hero image is not allowed to be heavier than the source HTML of the page.

This implementation does not attempt to build custom images, and so relies on me to build suitable candidiates manually.

2017/08/11: Hunting For Big Beasts With Regexes

Checking for large file downloads from the site I tried a log search:

egrep '^(m|(www))[.]earth[.].*" 200 [0-9]{7} ' logfile

which looks for entries with a 200 response and a 7-digit size code, eg 1MB or up (there is one at 8 digits on the www. main site, but none at 6 on the m. mobile site).

That quickly revealed one JPEG image that I was able to swap out for a 90kB version, ie 10x smaller, in two minutes c/o my favourite tinypng (thank you)!

A more selective search for large JPEG and PNG images on the main site (that tinypng might be able to help with for example):

egrep '^www[.]earth[.].*[.]((png)|(jpg)) HTTP/1.1" 200 [0-9]{7} ' logfile

revealed a 4.7MB monster JPEG that tinypng was able to shrink to 680kB, for example.

A quick bit of poking about with a variant for 500kB+ images:

egrep '^www[.]earth[.].*[.]((png)|(jpg)) HTTP/1.1" 200 [5-9][0-9]{5} ' logfile

immediately yielded another 800kB+ JPEG that was not really worth the candle.

It may be useful to refine such a search to large things being downloaded by real humans (not bots/spiders/crawlers) as part of a page display, rather than electively and only rarely. The power of regex can do that!

2017/08/09: All Be Unpending

Finally, my last Atom sitemap feed came out of "Pending" in GSC. That took a while! In this case it claims to have indexed all the ten files mentioned in this specialised/narrow sitemap.

I've started work on auto-injecting hero images into articles where there is nothing already manually placed, the image size (pixels and bytes) is reasonable, etc. My hidden ulterior motive is to be able later to provide a pictorial new/popular story listing on the front page, a bit like (say)

2017/08/07: We Don't Need No Stinkin' HTML Improvements

Hurrah! Today GSC (Google Search Console) reports no suggested "HTML Improvements" for the main site. It may have helped that I fixed the straggler a while ago and manually resubmitted the URL for recrawling a few days ago.

And the main-site "Time spent downloading a page (in milliseconds)" is currently hovering around 260 compared to the 3-month average of 301ms. All good.

If only my "Structured Data" would stop slowly bleeding away, one day my Rich Card* might come...

*Historically one might have wished for a rich cad, ie a prince.


Having whined, gently, in the Webmaster Central Help Forum in the morning of the 12th about my Structured Data page count and other stats unchanged since 2017/08/02, in the afternoon I was able to report that "The Googler Fairy is watching again! My Structured Data page count has jumped by over 10% right now to be more like 2/3rds of all pages showing up in this report (while sitemap counts of indexed pages have not jumped)."

2017/08/05: Backup Time

This time of year, or in mid-December when I feel unmotivated to do much else and a tidy end of calendar year is approaching, are when my mind turns to backups.

So as well as writing it up, I'll actually be doing some this weekend, on-site and off, spinning rust and cloud.

2017/08/04: Googlebot Bandwidth

I have been on a mission to reduce the time to serve (particularly the first bytes of) anything on the critical path to rendering pages for a visitor. So that particularly means the HTTP and HTML headers, and then the body of the HTML itself. As far as possible CSS is reduced to a bare minimum and inlined for example. and images are given width and height attributes, and JavaScript is largely done without or is async, to keep the spotlight on the HTML itself.

Watching the Google Search Console "Crawl Stats", especially for the mobile site which is essentially only HTML pages, I am fairly confident that I have knocked ~20ms off typical download time to ~200ms now.

(As far as I can tell this notion of a 'page' also includes images, CSS, data, and anything else that is crawled.)

Now that I am statatically pre-compressing the mobile HTML pages, and the gzip/mod_deflate code is not fighting for CPU for them, the download time is much less spiky/volatile, even with more than an order of magnitude fluctuation in how much is downloaded by Googlebot each day.

mobile page download time
Mobile site page download time: more consistent and a little lower since mid-July when pre-compression was set up (and indeed since mid-June when Apache was re-tuned); latest 204ms@2017/08/01, recent worst 1066ms@2017/06/17.

However, even more interesting over the last few days on the main (www) site has been observing a natural experiment or three where for example the number of items downloaded per day hasn't changed much but the mean kB weight of them has. In particular, where the mean size per download went up ~90%/200kB, download time also went up ~200ms, implying 1ms/kB or 1s/MB time, or ~8Mbps effective bandwidth (with ~100ms minimum download time). Given that the FTTC uplink from my RPi is not much higher than that, the implication is that the RPi2 is managing to near-saturate the line. (I see the same bandwidth internally, over WiFi, though other throughput measures reported on the Web suggest that the RPi2 can pump out more like 80Mbps over HTTP, especially if a better Ethernet connection is used, ie not over the board's USB.)

This also implies that maybe I have too many big objects on the site still, and looking for anything over ~200kB yields images in the MB range that could almost certainly be usefully (and non-visibly) compacted at some point!

I also should work on the couple of HTML files whose uncompressed size is well over 100kB and which may be hard for spiders to fully digest and for mobile browsers to cache and otherwise manage. I have made a start on one of them already.

(Note that a little while before setting up pre-compression I expanded the number of connections/users that Apache could handle at once; I was less constrained on memory than when I originally configured it (for 512MB main memory), and I think that some of the delays seen before were queueing to be serviced rather than service time itself.)


Within ~2.5h of posing a question in the Webmaster Central Help Forum Should I be worried about two of my Atom sitemap feeds sticking at "Pending"? one of the two Atom feeds (the general sitemap.atom) came out of "Pending". So maybe my fairy godmother, or slightly more likely a friendly Googler, is watching, though the remaining one was still pending by end of day!

Note that even the remaining "Pending" Atom feed can very quickly (within a few minutes) update the 'Submitted' column value (for the number of URLs in the feed file) in response to an external sitemap ping. That ping does get the Googlebot to fetch the feed file immediately (unlike Bing or Yandex), which allows/drives the GSC update.

2017/08/01: Atom Sitemaps Pending

I submitted three Atom-based sitemaps on the same day. One (for the data feed) came out of "Pending" and showed the number of files indexed from it after about three days. The other two, even though all have been updated at least daily, are still showing "Pending" after a week or so. Why? Possibly because the "Pending Two", as they may be known in infamy and legend, have been regularly 'pinged' from the Web rather than let GSC (Google Search Console) choose when to update them? Infamy, infamy, does Google have it in for me? (Apologies to Kenneth Williams!)