Earth Notes: On Website Technicals (2017-06)

Updated 2024-03-30.

Tech updates: CDN revoked, structured data, 10 years old, XML sitemap at long last and lastmod, HTML5 conformance, PageSpeed.

Kicking off a series of notes on tech issues arising from creating and extending the EOU "Earth Notes" site. From HTML5 conformance to EOU's 10th birthday!

2017-06-30: Check My Links and PageSpeed

In the auto-generated Historical Stats page I found not only some dead external links, but also an uncorrected bad internal link to the related live auto-generated data page (no '_' prefix added), all of this prompted by playing with the Check My Links Chrome extension.

I have killed off a number of dead links and removed some redirections (eg where the external site has moved to https://) with this tool. Much much nicer than the linkchecker tool in terms of interactivity, and ability to check all https links.

(2017-12-31: this tool has proven very useful and robust, and I usually run it over any page I am doing other incremental maintenance on, if I am not in a dreadful hurry. I have just given it a 4-star rating with the comment "Suggestion: it would be nice to have a "take me to the first bad or redirecting link" button, since sometimes a pale green can be hard to spot on a busy page, for example!")

PageSpeed Apache Module

While avoiding important stuff that I should be doing, I came across Google's PageSpeed Module (for Apache 2.2 in my case) that appears to dynamically optimise (eg minify) pages and other objects as they are served, sometimes tuned to the particular browser requesting the object. However, given the limited resources (especially memory) for my RPi server, and the more subtle reported problems that this module can induce, and the fact that this site's pages are getting up to 100/100 on PageSpeed anyway, I think I'll give this a miss for now.

Prefetch No-go

Even with just 'prefetch' as set up a few days ago the icons file seems to be competing with more important traffic on Chrome, using an extra connection on the server, and not making a significant difference to page speed. So it's off again.

2017-06-25: HTML5 Fixes Done

HTML5 fixes finished in the wee hours. I have re-adjusted the makefile this time to batch test all the main (www) pages in one go, along with one sample mobile page version to catch wrapping errors, since checking ~140 pages barely takes longer than checking one.

Tweaked a few page descriptions to a better length, and moved some key words nearer the start of their pages' titles.

I have added a (low-priority) 'prefetch' header link for the Share42.com social sharing buttons sprite for desktop pages to try to ensure that on first load the page top navigation can render a little quicker given that it contains a copy of those sharing buttons. (The sprite file is invisible to the browser's preloader otherwise.) In earlier testing 'preload' seemed to be too high a priority and this file was out-competing more important content.

Note that this sprite prefetch seems to start an extra concurrent HTTP/1.1 connection on Chrome (rising from 4 to 5, as tested with WebPageTest.com), though it may save opening one at the end after keep-alives timeout if (say) ads take a long time to finish loading and displaying. Firefox seems not to be acting on the prefetch hint.

Manually removed unnecessary attribute quotes from the boilerplate wrapper HTML and saved ~100 bytes from a sample generated file's size. However it makes only a few bytes' difference to the gzipped output, which indicates how good a job gzip/deflate does on HTML, and that it's simply not worth sweating over or adding unnecessary confusion.

2017-06-24: HTML5 Conformance

Still going on the HTML5 conformance fixes. Maybe my oppressors will let me out to see daylight soon and it'll all have been worthwhile...

Validating each HTML file separately is so hideously slow that I have changed the system to validate all HTML outputs of one type (ie www or m) in one batch, which basically takes no more time than checking one. So bad files could sneak out again, as they always have, but can be checked for quickly and are by default with (say) make pages and thus make all.

2017-06-22: vnu.jar

I have integrated the vnu.jar stand-alone command-line version of the Nu Html Checker HTML5 validator (and outliner) into my site/page build, so that I can ensure that all my pages are technically conformant for a better and nominally more robust user experience.

The v.Nu checker is huge (~24MB of goodness) and slow (~30--120s to check one small document) on my RPi2 server, so I have for example turned off page (natural) language detection to speed things up a bit.

Also, to avoid breaking my edit-with-vi / rebuild-and-check cycle, the validation is only performed for the mobile version of the pages; they should now only ever go out pristine, while the desktop pages may occasionally be a bit quick and dirty. Not that browser breakage seems to be a huge issue at all in practice anyway.

Unfortunately nearly every page has some issues, partly from the years of cruft, and writing for older browsers and HTML versions.

I will gradually manually fix all the issues and tidy up some features of page structure while I am at it, such as making use of aside and nav and footer in particular.

One of the few irritations is that the validator treats <table border="1"> as an error not a warning, and there is no really simple small non-invasive CSS replacement; I am currently using <table class="tb1"> with this CSS.

2017-06-21: Nibbler

Having been egged on by Nibbler: a free tool for testing websites I now have simple print CSS support; basically it hides anything not appropriate for a printed copy such as site navigation, ads and search. Simples.

2017-06-18: Apache Server Beefed Up: Next/Prev Also

The Apache configuration had been trimmed right down to conserve memory; I have beefed it up a bit just in case the site suddenly becomes popular. Also, I discovered a config error in passing (trying to serve a site for which the IP address is not even local) which I cleaned up!

I linked up with 'prev' and 'next' headers the RPi 2 piece (via RPi and Sheevaplug) back to the Linux laptop piece, and did some tidy-up for ads and HTML5 conformance.

2017-06-17: XML Sitemap 'lastmod'

I have been wondering whether the XML sitemap lastmod element should reflect (significant) content updates to page content, or the actual timestamp of the page which may change for purely stylistic updates or even just to keep make happy in effect.

I would prefer the former, like the HTML ETag 'weak' validation semantics, and happily Google's Webmaster Central Blog (Oct 2014) Best practices for XML sitemaps & RSS/Atom feeds says that the lastmod value should reflect "the last time the content of the page changed meaningfully."

So I have updated my sitemap generator to use the source file date rather than the output file date, which also means that it can depend on the input rather than output files in the makefile (ahoy extra parallelism).

Even when sticking with dates (no timestamps, since intra-day changes are not hugely meaningful for this site), the size of the compressed data (ie gziped over the wire) can be expected to go up as there will usually be more date variation now.

Before:

% ls -al sitemap.xml
15813 Jun 17 17:19 sitemap.xml
% gzip -v6 < sitemap.xml | wc -c
 85.7%
2285

After:

% ls -al sitemap.xml
15813 Jun 17 17:35 sitemap.xml
% gzip -v6 < sitemap.xml | wc -c
 84.6%
2457

I note that the best practices document suggests pinging Google (and presumably other search engines too) after updating the sitemap. That could be automated to be done (say) overnight at most once per day, to avoid multiple pings as I do a stream of micro-updates, though I think that Google typically does recheck the sitemap daily anyway, from recent observations.

I have also being working on improving the semantic structure of the generated HTML pages, eg with main and aside, and trying to ensure that 'outliner' output looks sensible too. That should help both search engines and anyone with a screen reader.

2017-06-15: XML Sitemap

Today's displacement activity has been extending the makefile to create/update an XML sitemap whenever one of the main HTML pages is updated.

At the moment because this is in the main edit-generate-edit cycle while I am hacking a page, and is not instant, and because Google seems to be refusing explicitly to index my mobile/alternate pages anyway, I am only doing this for the desktop/canonical pages for now.

# XML sitemap with update times (for generated HTML files).
# Main site; core pages + auto-updated.
sitemap.xml: makefile $(PAGES)
    @echo "Rebuilding $@"
    @lockfile -r 1 -l 120 $@.lock
    @/bin/rm -f $@.tmp
    @echo>$@.tmp '<?xml version="1.0" encoding="utf-8"?>'
    @echo>>$@.tmp '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
    @for f in $(URLLISTEXT); do \
        echo '<url><loc>'$(URLLISTPREFIX)$$f'</loc><changefreq>hourly</changefreq></url>'; \
        done >>$@.tmp
    @for f in $(PAGES); do \
        echo '<url><loc>'$(URLLISTPREFIX)$$f'</loc><lastmod>'`date -r$$f -u
+'%Y-%m-%d'`'</lastmod></url>'; \
        done | (export LC_ALL=C; sort) >>$@.tmp
    @echo>>$@.tmp '</urlset>'
    @-chmod -f u+w $@
    @chmod -f 644 $@.tmp
    @/bin/mv $@.tmp $@
    @chmod a+r,a-wx $@
    @/bin/rm -f $@.lock $@.tmp
all:: sitemap.xml

The main intent is to use the lastmod flag to quickly and efficiently signal to crawlers and search engines that particular pages have been updated and should be recrawled soon, rather than them having to guess when to respider and check.

For the auto-updating pages (GB grid carbon intensity) I am using changefreq (at 'hourly', even though it's really every 10 minutes) instead of lastmod, in part so that the XML sitemap does not need to be updated whenever one of them is changed.

Although the raw XML file is much larger than the simple URL list, after compression the difference is much less marked.

% ls urllist.txt sitemap.xml
15612 Jun 15 09:10 sitemap.xml
 8285 Jun 15 09:10 urllist.txt
% gzip -v6 < urllist.txt | wc -c
 75.7%
 2032
% gzip -v6 < sitemap.xml | wc -c
 85.4%
 2298

The XML file should be updated after every page refresh to capture the lastmod signal for crawlers, whereas the urllist.txt file only needs updating when the set of HTML pages changes, eg when a new article is created.

I also added a robots 'noindex' meta tag to the site guide (aka HTML site map) to try to keep it out of the search engines, since it's not very useful to a visitor direct from such an engine. Likewise for the 'other links' page.

2017-06-11: Glorious Decade

It suddenly occurred to me in the bath that EOU / Earth.Org.UK / Earth Notes really is 10 (and a bit) years old.

And yes, the page is a bit heavy, but we can blow 120kB on then-and-now screenshots every decade; party like it's 2099!

(Another of my sites, the gallery, is 20 years old, and ExNet's has been going more like 22 years since we offered dial-up Internet access.)

And yes, there is a broken link in that 2007 screenshot. Here's the missing image!

The basic site structure has been kept fairly simple, with all the main pages at top level, which thus made creation of the parallel mobile (m.) site relatively easy. Once upon a time when operating systems scanned directories linearly the ~320 current entries in that master directory might have resulted in a speed penalty to serve pages, but with filesystem cacheing and other smarts, less so.

Most anything other than HTML objects have now been moved out of the top directory, for example images and other immutable stuff under img/, and updating graphs and the like under out/, and data sets (growing/static) under data/. The HTTP server provides extended expiries for objects under img/ and slower-updating objects under out/ to help cacheing.

To grow from a 1-page site to a more complicated 100+ page site with consistent headers and footers and look and feel as required increasing use of CSS (currently kept very small, and inlined), and other meta-data at/near each raw page header.

<h1>Earth Notes is 10!</h1>
<div class="pgdescription">10 years of getting greener...</div>
<!-- meta itemprop="datePublished" content="2017-06-11" -->
<!-- SQTN img/EOUis10/10.png -->
<!-- EXTCSS img/css/fullw-20170606.css -->

These lines are extracted from the raw internal HTML source, stripped out, and reconstituted into:

The page title used in various ways and a wrapped-up new H1 tag.
A description and sub-head and other uses.
A first-publication date in various places, including the footer.
The page 'image' for social media and structured data / microdata.
Some extra CSS injected into the page head.

The first four of those allow better support for various forms of page markup, social media and microdata for search engines.

The last allows me to inject a tiny bit of extra (and versioned) CSS into the page header to allow the screenshots to expand out of the normal page container to up to the full viewport for newer browsers.

Ah yes, page containers.

Until recently EOU was a fully fluid layout which did not work very well on either very wide or very narrow devices. So first I added the standard boilerplate meta viewport content="width=device-width,initial-scale=1.0,user-scalable=yes" header, and then I wrapped up the body in a div container with an eye-friendly max-width and I also made images responsive in a number of ways from max-width of 100% for big images (or 50% or 33% for floats) up to playing with srcset and sizes.

To also optimise the mobile version of the site there are directives to select bits of the HTML for only desktop (or mobile), usually to omit some of the heavier and less-important stuff for mobile. Also, for a couple of things such as favicon.ico and some of the social media buttons support, to minimise round-trips during page-load for mobile eg for redirects, there are copies of a couple of key objects on the m. site.

Oh, and today's playtime is splitting up my sitemap into HTML pages and data directory indexes so that I can track search engine indexing better (and because it keeps my makefile simpler), which means that I also now have two Sitemap entries in my robots.txt.

(PS. The British Library is busy crawling its nominally annual copy of the site at about an object per second or a little less. That could cover the ~3000 current data files in under an hour, but somehow is taking much longer!)

2017-06-10: Microdata HTML vs JSON-LD

With much of the page microdata markup there is a choice of adding/extending the HTML tags, or adding JSON-LD script elements.

An advantage of the HTML route is that it is potentially easier to ensure that it is kept in sync with what is being marked up, if it is on the page.

For data sets not on the page, JSON-LD may be better by allowing more detail to be provided than makes sense to display in the HTML page, and Google et al are unlikely to assume 'cloaking', ie showing the search engines something different than the user sees, which used to be a staple of "WebSPAM" and "Made-For-AdSense" ("MFA").

In all cases I want to meet the intent of the ExNet style guide which is to have the above-the-fold content/information rendered within the first ~10kB delivered to the browser so that the user perceives speed. To this end, whichever format allows me to move some of the meta-data later in the page text delivery, below the fold or at the end, is potentially better, since the meta-data is not needed at page load, but off-line in the search engines' secret lairs.

(Hmm, now I am adding link prev/next items to the head for clear sequences of pages. Not many pages, and the extra early page weight is not high...)

2017-06-09: CDN Traffic Brought Home

Having performed a month-long experiment using Cloudflare as a CDN for this site's static content, I have redirected that traffic back to the main (www) server, as only maybe 50--70% was being cached by Cloudflare, and the rest was slower and probably taking more resources (time, energy, carbon) overall to serve. A quick WebPageTest suggests that there is no apparent performance penalty for doing so for the typical visitor. There are a few extra connections to this server to support HTTP/1.1, especially these pages with lots of inline objects (eg images) that would otherwise be multiplexed down a single (Cloudflare) HTTP/2 connection. I may need to tweak the local Apache to support a few more concurrent connections.

2017-06-08: Rich Cards Marked?

I have been larding the site up with structured data, including converting all pages to instances of schema.org/Article, which should be candidates for Google's "rich cards" (an extension of "rich snippets") in search, though there is no sniff of that yet. Google is reporting the structured data in the Webmaster tools console.

I am not convinced that any of this microdata helps in any clear way, though I do like the idea of making meta-data less ambiguous and consistently available, and data sets more discoverable. (I don't see much evidence of direct benefit for SEO/SERP, other than appearance and helping the search engines understand content.)

Nominally the site now has 133 articles, including the home page and the HTML site-map page.

Having relented and added site maps for the main and mobile sites a couple of weeks ago, Google finally reports having indexed nearly all the main-site pages creeping up slowly from the initial (~90%) level already in place when the map was added, but for mobile pages is still stuck at 8 (out of 133). Having put all the canonical/alternate header links in before, I don't understand why the mobile site is everything or zero. Maybe it is effectively zero. Still, I am getting ~25% of organic searches coming into the mobile pages, so...

(2017-12-31: a 'job' posting page in 2017-11 finally made it to "Rich Card" status.)