Earth Notes: On Website Technicals (2017/08)

tools
Tech updates: Atom sitemaps (un)pending, Googlebot Bandwidth, backups, HTML Improvements, hunting with regexes, front-page heroes, Cache-Control, restart drill, minifying...

2017/08/22: HTML Minifying

I've been working on reducing the size of my very largest HTML pages, compressed and uncompressed, to reduce maximum download time.

I've also introduced some pagination where logical, tied together with prev and next links.

A work in progress is using CSS to replace program-generated and highly-repetitive inline styles; that has knocked maybe 15% off the uncompressed page size, though has had relatively little impact on the compressed size thanks to the wonders of zlib/zopfli. Not done yet. This one will probably get some pagination too...

I've squeezed a few more bytes out of the HTML head in particular, by aggressively unquoting HTML tag attribute values that don't need it (essentially that do not contain any of space, > ' ' or quotes.

Advertisements

2017/08/19: GSC Atom Sitemaps Delay

It seems to take as much as a week or two for changes in the "Indexed" column under GSC -> Crawl -> Sitemaps to come through, and Atom sitemaps seem especially slow for me. This may be on purpose by Google to stop people getting too accurate a view of which exact URLs are in the index at any given moment. In any case I'm pleased to see that most of the things that I expect to be index, by count, are.

In contrast, the "Submitted" column updates virtually instantly after (say) an external ping of a sitemap; a refresh of the page immediately after the ping shows the new value.

2017/08/18: Restart Drill

We practiced the semi-annual-ish emergency 'smart-hands' power-cycling-because-the-server-seems-dead thing today, and it worked! As does my hyphen key.

2017/08/17: Front-page Cache-Control

Since the home pages (both m and www versions) now have some dynamic content, ie may change even when I don't explicitly edit them, I've adjusted the Apache config to set their expiry (Cache-Control, and on www Expires also) to about a day (geekery alert: with a value relatively efficient to transmit for HTTP/1.1 and HTTP/2 HPACK):

<LocationMatch "/index.html">
ExpiresDefault "access plus 92221 seconds"
</LocationMatch>

Note that the match is on /index.html even though the externally-visible URL tail is /.

I made a couple of other tweaks including reducing the expiry time on the top-level _* status pages to 10 minutes to match their typical re-generation interval.

2017/08/15: Front-page Heroes

Now the front page is automatically adorned with a responsive 3, 2 or 1 column box of Newest/Popular/Updated articles. The first available hero image for each is shown.

While there are general byte weight limits for hero images for desktop and mobile, for mobile a hero image is also not allowed to be heavier than the (uncompressed) HTML of the page it is for, to help keep pages loading quickly over the air.

2017/08/14: Zopfli Faster

I am liking my zopfli and zopflipng, but they are eating significant time. Boosting zopfli performance shows some ways to get significant speed-ups, not all requiring code changes.

As a simple test I have adjusted the stock Makefile, for each of the C and C++ compilations (on the RPi) changing -O2 to -O3 and adding the flag -DNDEBUG=1 to try to disable assert()s, and -flto -fuse-linker-plugin to try to encourage cross-file optimisation. I certainly seem to get some extra warnings implying that gcc/g++ is looking rather harder at the code.

I did a trivial test in the build directory to check that the zopfli output is not completely broken (no diff output is good):

zopfli -c Makefile | gzip -d | diff - Makefile

An initial quick attempt to time old vs new was inconclusive because the variability between runs seemed to bigger the than the difference between the old and new zopfli!

Similarly, a test of zopflipng old vs new on a reasonable size (already zopflipng-ed) PNG yielded very close times, maybe 25.5s vs 23.5s, but with a lot of jitter, so possibly not worth the risk of broken outputs!

2017/08/13: Holding out for a Hero

I've added code to choose a suitable hero image for each page (based on the declared og:image) that is not explicitly providing one (or another image near the top of the page that might clash). The algorithm looks for the largest image version (amongst different -nNNNw versions) that isn't too heavy. There are some fixed weight (ie file size) limits for desktop and mobile for what is definitely too heavy, and a preferred limit. On mobile also the hero image is not allowed to be heavier than the source HTML of the page.

This implementation does not attempt to build custom images, and so relies on me to build suitable candidiates manually.

2017/08/11: Hunting For The Big One with Regexes

Checking for large file downloads from the site I tried a log search:

egrep '^(m|(www))[.]earth[.].*" 200 [0-9]{7} ' logfile

which looks for entries with a 200 response and a 7-digit size code, eg 1MB or up (there is one at 8 digits on the www. main site, but none at 6 on the m. mobile site).

That quickly revealed one JPEG image that I was able to swap out for a 90kB version, ie 10x smaller, in two minutes c/o my favourite tinypng (thank you)!

A more selective search for large JPEG and PNG images on the main site (that tinypng might be able to help with for example):

egrep '^www[.]earth[.].*[.]((png)|(jpg)) HTTP/1.1" 200 [0-9]{7} ' logfile

revealed a 4.7MB monster JPEG that tinypng was able to shrink to 680kB, for example.

A quick bit of poking about with a variant for 500kB+ images:

egrep '^www[.]earth[.].*[.]((png)|(jpg)) HTTP/1.1" 200 [5-9][0-9]{5} ' logfile

immediately yielded another 800kB+ JPEG that was not really worth the candle.

It may be useful to refine such a search to large things being downloaded by real humans (not bots/spiders/crawlers) as part of a page display, rather than electively and only rarely. The power of regex can do that!

2017/08/09: All Be Unpending

Finally, my last Atom sitemap feed came out of "Pending" in GSC. That took a while! In this case it claims to have indexed all the ten files mentioned in this specialised/narrow sitemap.

I've started work on auto-injecting hero images into articles where there is nothing already manually placed, the image size (pixels and bytes) is reasonable, etc. My hidden ulterior motive is to be able later to provide a pictorial new/popular story listing on the front page, a bit like (say) Treehugger.com.

2017/08/07: We Don't Need No Stinkin' HTML Improvements

Hurrah! Today GSC (Google Search Console) reports no suggested "HTML Improvements" for the main site. It may have helped that I fixed the straggler a while ago and manually resubmitted the URL for recrawling a few days ago.

And the main-site "Time spent downloading a page (in milliseconds)" is currently hovering around 260 compared to the 3-month average of 301ms. All good.

If only my "Structured Data" would stop slowly bleeding away, one day my Rich Card* might come...

*Historically one might have wished for a rich cad, ie a prince.

Jump

Having whined, gently, in the Webmaster Central Help Forum in the morning of the 12th about my Structured Data page count and other stats unchanged since 2017/08/02, in the afternoon I was able to report that "The Googler Fairy is watching again! My Structured Data page count has jumped by over 10% right now to be more like 2/3rds of all pages showing up in this report (while sitemap counts of indexed pages have not jumped)."

2017/08/05: Backup Time

This time of year, or in mid-December when I feel unmotivated to do much else and a tidy end of calendar year is approaching, are when my mind turns to backups.

So as well as writing it up, I'll actually be doing some this weekend, on-site and off, spinning rust and cloud.

2017/08/04: Googlebot Bandwidth

I have been on a mission to reduce the time to serve (particularly the first bytes of) anything on the critical path to rendering pages for a visitor. So that particularly means the HTTP and HTML headers, and then the body of the HTML itself. As far as possible CSS is reduced to a bare minimum and inlined for example. and images are given width and height attributes, and JavaScript is largely done without or is async, to keep the spotlight on the HTML itself.

Watching the Google Search Console "Crawl Stats", especially for the mobile site which is essentially only HTML pages, I am fairly confident that I have knocked ~20ms off typical download time to ~200ms now.

(As far as I can tell this notion of a 'page' also includes images, CSS, data, and anything else that is crawled.)

Now that I am statatically pre-compressing the mobile HTML pages, and the gzip/mod_deflate code is not fighting for CPU for them, the download time is much less spiky/volatile, even with more than an order of magnitude fluctuation in how much is downloaded by Googlebot each day.

mobile page download time
Mobile site page download time: more consistent and a little lower since mid-chart (~July 14) when pre-compression was set up; latest 204ms@2017/08/01.

However, even more interesting over the last few days on the main (www) site has been observing a natural experiment or three where for example the number of items downloaded per day hasn't changed much but the mean kB weight of them has. In particular, where the mean size per download went up ~90%/200kB, download time also went up ~200ms, implying 1ms/kB or 1s/MB time, or ~8Mbps effective bandwidth (with ~100ms minimum download time). Given that the FTTC uplink from my RPi is not much higher than that, the implication is that the RPi2 is managing to near-saturate the line. (I see the same bandwidth internally, over WiFi, though other throughput measures reported on the Web suggest that the RPi2 can pump out more like 80Mbps over HTTP, especially if a better Ethernet connection is used, ie not over the board's USB.)

This also implies that maybe I have too many big objects on the site still, and looking for anything over ~200kB yields images in the MB range that could almost certainly be usefully (and non-visibly) compacted at some point!

I also should work on the couple of HTML files whose uncompressed size is well over 100kB and which may be hard for spiders to fully digest and for mobile browsers to cache and otherwise manage. I have made a start on one of them already.

(Note that a little while before setting up pre-compression I expanded the number of connections/users that Apache could handle at once; I was less constrained on memory than when I originally configured it (for 512MB main memory), and I think that some of the delays seen before were queueing to be serviced rather than service time itself.)

Unpending

Within ~2.5h of posing a question in the Webmaster Central Help Forum Should I be worried about two of my Atom sitemap feeds sticking at "Pending"? one of the two Atom feeds (the general sitemap.atom) came out of "Pending". So maybe my fairy godmother, or slightly more likely a friendly Googler, is watching, though the remaining one was still pending by end of day!

Note that even the remaining "Pending" Atom feed can very quickly (within a few minutes) update the 'Submitted' column value (for the number of URLs in the feed file) in response to an external sitemap ping. That ping does get the Googlebot to fetch the feed file immediately (unlike Bing or Yandex), which allows/drives the GSC update.

2017/08/01: Atom Sitemaps Pending

I submitted three Atom-based sitemaps on the same day. One (for the data feed) came out of "Pending" and showed the number of files indexed from it after about three days. The other two, even though all have been updated at least daily, are still showing "Pending" after a week or so. Why? Possibly because the "Pending Two", as they may be known in infamy and legend, have been regularly 'pinged' from the Web rather than let GSC (Google Search Console) choose when to update them? Infamy, infamy, does Google have it in for me? (Apologies to Kenneth Williams!)