Earth Notes: On Website Technicals (2018-12)

Updated 2024-03-11.
Tech updates: feeds, IMG beyond AMP, Gallery CMS, test cases, random rebuild order, speakable structured data, lighter 404, moar AMPy, featured snippet.
Amongst AMP-y and other things, this month I introduced some early 'speakable' structured data support to the site. It may not get used for a while, if at all!

2018-12-30: AMP Cache Oddities

As reported <amp-img> (at least via cdn.ampproject.org) inserts bizarre non-optimal srcset #20104, some of the things done in the AMP Cache seem unhelpful, even if most seem sensible.

Here is an extract from the report, made today:

What's the issue?

The AMP cache (at least for cdn.ampproject.org which I can observe) deoptimises image access if presizing has already been done. A srcset is added that specifies several (fictitious) image versions all larger than the original.

How do we reproduce the issue?

For example, in http://amp.earth.org.uk/note-on-survey-results.html the line:

<a href=http://gallery.hd.org/_c/mechanoids/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.jpg.html><amp-img src=http://www.earth.org.uk/img/a/b/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.l95176.211x330.jpg layout=intrinsic class=respfloatrsml width=211 height=330 alt="vote/survey" title="vote/survey"></amp-img></a>

gets expanded in the AMP cache in https://amp-earth-org-uk.cdn.ampproject.org/c/amp.earth.org.uk/note-on-survey-results.html to:

<a href=http://gallery.hd.org/_c/mechanoids/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.jpg.html target=_top><amp-img alt=vote/survey class=respfloatrsml height=330 layout=intrinsic src=https://www-earth-org-uk.cdn.ampproject.org/i/www.earth.org.uk/img/a/b/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.l95176.211x330.jpg srcset="https://www-earth-org-uk.cdn.ampproject.org/ii/w220/www.earth.org.uk/img/a/b/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.l95176.211x330.jpg 220w, https://www-earth-org-uk.cdn.ampproject.org/ii/w470/www.earth.org.uk/img/a/b/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.l95176.211x330.jpg 470w, https://www-earth-org-uk.cdn.ampproject.org/ii/w680/www.earth.org.uk/img/a/b/UK-Millennium-Dome-voting-ticket-credit-card-sized-uniquely-coded-tweaked-1-DHD.l95176.211x330.jpg 680w" title=vote/survey width=211></amp-img></a>

with these entirely spurious srcset entries all nominally larger than the 211x330 original.

Why?

It's a waste of HTML and processing time at best on the client, and at worst makes for images that are sent larger than necessary and require client CPU memory and time to resize.

I had a quick response from 'Gregable' 2019-01-02:

None of these images are actually larger than the original. The /ii/w680 is indicating a maximum width, not the actual width. The cache doesn't actually return an image that large, it returns an image of min(original width, indicated width).

You are correct in that in this case, the srcset is not actually helping any. All of the images in the srcset are essentially the same. So in theory it's adding bytes to the HTML document and some very minimal CPU for parsing the srcset string. That said, I think these are probably not worth worrying much about. The extra bytes in the document are going to get gzip compressed away generally.

The reason for this is that the image and document are cached independently. When the srcset is generated, the cache doesn't know the image dimensions. This is done for a few reasons:

  • It's possible that the image could be updated and change dimensions for example. Updating all referencing document when the image changes is possible, but increase cpu costs on the server.
  • It also means that in a cold cache cache, the document cannot be returned to the user until the image has been fetched, which slows down delivery of the document. This has a far more significant affect on user experience than the CPU cost of unnecessary srcset parsing.
...

There is also a comment in this amp-img issue discussion suggesting that what should really happen is to have <amp-img> generate the srcset in javascript with the precise width that the page requires, rather than a few specific possible widths.

It turns out that the only current way to suppress this inserted srcset is to have one of my own already there. Apparently srcset=" " should do, but I could actually insert a useful srcset with a new smallest entry for both desktop and mobile, maybe the size of the smallest carousel entry (200px wide), as long as (say) 20% narrower than the existing smallest entry. And also benefiting from an 'L' version for Save-Data automatically...

PageSpeed

Oddities notwithstanding, PageSpeed Insights rates the AMP (and m.) version of this page 100/100! (The www. version is 57/100 because of Google ads!)

Decoding Async

I just became aware of the new decoding=async option for img tags, to try to help performance by deferring the decode step. AMP apparently applies it to all amp-imgs. I am testing applying it to all body IMGs, most of which will not even be above the fold for mobile.

One of the older discussions gives the best description of apparent intent for this declarative usage, better than the standards! As put by vmpstr on :

Give the async attribute three values: async, auto, and sync. sync behaves as today's image elements do without any attribute specified, where once an image has been loaded it will appear immediately (and be decoded synchronously) if inserted into the document. Images marked async can get loaded/decoded best-effort without janking. auto would involve browser heuristics to decide if the image could and should be loaded async or if it needs to be sync. async is therefore just a more aggressive version of auto in practice. sync is mostly just there as a safety valve for developers in case browser heuristics get it wrong.

2018-12-28: Featured Snippet!

Doing a Google search from mobile for one of my key terms, my "Why XXX?" heading followed by the start of the following para, showed up as a featured snippet. The same (m.) page shows up as a normal SERP entry a little further down with rich text (for a review) and the same snippet.

Hurrah!

Note that there is no special (eg schema.org) markup around the snippet, just a clear short question in the (h2) heading, and a simple short and sweet para immediately underneath answering it.

2018-12-27: AMP Social Media Buttons, and Tests

Tentatively, I have added sharing with amp-social-share.

In all the recent upheaval, AMP support included, a number of things seem to have been silently broken such as dropping the og:description meta tag.

I have added unit tests for the test page to cover variants of some of the issues found today.

2018-12-23: AMP Live Today

The amp.earth.org.uk site went live!

(Incidentally that home page can be accessed also via the Google AMP Cache.)

There are few pages that are not supported for AMP. Most of them may never be, eg because they provide on-page JavaScript-based demos. Some just need a little further housekeeping to join the AMP universe.

In such a case, no AMP page is created nor linked to, and any attempt to access it will be redirected to the vanilla mobile/lite 'm.' page.

An annoying issue: the AMP cache knows that it is loading from an http: source, and rewrites relative links to absolute (http:) links. But it does not rewrite protocol-relative (scheme-relative?) links (eg //WWW.earth.org.uk) to absolute, so they will fail trying to reach a currently non-existent https: server. So I am now introducing pseudo-protocol-relative //WWW.earth.org.uk links to match the existing //STATIC.earth.org.uk links, that get re-written to the primary absolute form, though remain syntactically valid as raw HTML.

2018-12-24: with some prodding by me, Googlebot is sucking in the AMP pages and starting to report in the Search Console. Interestingly I've just had a complaint about omitting some metadata (embedded video schema.org markup) from the AMP page that is in the desktop page. I've never had such a complaint about a linked m-dot/lite page which does the same. So I've fixed the build script to show all of that related metadata in all versions now.

Now all but 15 (out of ~207) pages have AMP versions. I manually fixed up nearly 600 links to deal with the protocol-relative issue also...

2018-12-21: AMP and Inline CSS Styles

I had somehow convinced myself that AMP did not allow inline CSS style in the page body. So I did a lot of work to eliminate common inline styling, partly because doing so can also make the page smaller. But I was sure that it was going to be a big problem for many remaining pages.

(Maybe this would indeed have been an issue until recently, see eg You can't use inline 'style="..."' tags, but AMP: Supported CSS says that inline styling is allowed when reading it today.)

However, inline styling seems not to be a significant problem since only ~40 of the ~200 main pages are failing to validate in AMP form. And many of those have simple/known img and script issues...

The validator uses the latest published set of rules to apply across the network, which means that details and summary are already passing validation (hurrah!). But it also means that attempting to build and validate AMP pages fully off-line, as I can with desktop and vanilla mobile, will result in:

ERROR: validation: Unable to fetch https://cdn.ampproject.org/v0/validator.js - getaddrinfo ENOTFOUND cdn.ampproject.org cdn.ampproject.org:443

I am not sure that I want to be prodding the CDN for every single page rebuild, for a number of reasons.

Maybe amphtml-validator-rules would be part of a mechanism to help me work more locally.

2018-12-23: Google's Search Console still objects to the details and summary tags, declaring the AMP page to be 'invalid'.

2018-12-20: Lighter Error Page

I have reduced the size of the custom 404 error page to 894 bytes of body when pre-GZIP-compressed, plus ~340 bytes of desktop HTTP/1.1 headers. Thus the HTTP response for it should be able to fit in a single TCP frame to most clients. The mobile/lite page is even smaller.

All informational footers, and social-media header support such as og:image and twitter:card, are omitted for such an error page. This saves ~200 bytes from the GZIPed size.

There is a little more that could be stripped out, eg a little residue of social media button support (~80 bytes uncompressed) that could go, and would benefit all desktop pages not needing such support.

2018-12-15: Speakable Markup

Though it is unlikely to be used any time soon (ie probably only for US-originated Google News searches for now), I am starting to fold in some support for the 'pending' Schema.org 'speakable'.

I was partly spurred on by the relevant parts of the discussion in What to Expect in 2019 with Google's John Mueller.

This may in future help screen readers and voice searches. Being another site providing this data may in a tiny way speed its adoption.

Google's documentation has this firmly marked as BETA for now. (Also see "Add vocabulary to indicate which sections of a document are particularly 'speakable'".)

Note that Google's docs say not to use both xpath and cssSelector, but I am picking out title and description with the former and an optional intro para with the latter. I have split the structured microdata for the latter into its own itemscope and Google's Structured Data Testing Tool seems OK with that, showing the implied value for each item correctly.

All the new meta/structured data is at the very end of the document, so not in the CRP (Critical Rendering Path). Hurrah!

Note that HTML minification for the m-dot version rearranges (sorts) tag attributes to try to improve compression. However this seems to silently break extraction by Twitter of some header meta data such as description under some circumstances. So I have stopped doing that particular sorting.

The m-dot minifier also omits the inferrable head tag, and this minified HTML apparently defeated parsing when in full precise form:

/html/head/meta[@property='og:title']/@content

All is happy again when I slightly generalise the xpath, with minimal risk of picking up stray tags later!

//meta[@property='og:title']/@content

Schema.org SpeakableSpecification Example

Grabbed from the EOU home page (with some wrapping for readability):

<span itemprop=speakable itemscope itemtype=http://schema.org/SpeakableSpecification>
  <meta itemprop=xpath content="//meta[@property='og:title']/@content">
  <meta itemprop=xpath content="//meta[@property='og:description']/@content">
</span>
<span itemprop=speakable itemscope itemtype=http://schema.org/SpeakableSpecification>
  <meta itemprop=cssSelector content=.pgintro>
</span>

Note that this is largely fixed because it refers to existing pieces of text. The .pgintro part is left out if the page does not have a pgintro chunk of text. The title and description are always present, however.

2018-12-09: Random Page Build Order

At times there may be more than one make running to try to rebuild EOU. For example, while the battery charge is high a make -k all may be run every hour.

In particular, the rebuild of each page has a lock around it. Two or more make processes may end up continually trying to make the same page next, with one of them being excluded by the lock after a timeout, and moving on. The multiple processes tend to stay in lockstep, and all that lock contention wastes time and reduces parallelism.

In general make tries to be reasonably dependable and consistent, and shaking that up is hard. A reasonable solution for my gmake on *nx is, given my list of main pages in a "simply expanded variable" PAGES:

PAGES := pageA.html pageB.html ... another.html

... and given that each page's build is independent of the others, then adding this line afterwards mixes things up:

PAGES := $(shell echo $(PAGES) | xargs -n1 | sort -R | xargs)

This works fine with independent make runs with or without -j to add parallelism.

The cost is a the execution of the shell command once per make invocation.

2018-12-07: IMG Test Cases

It's a day off for me today, so of course what I do before breakfast is add a couple of tricky unit test cases for IMG!

solar PV grid tie roof mounted power system on slate roof on outbuilding shed garage 1 DHD

Let joy be unconfined!

I am modifying IMG to be able to accept as src a (standard) thumbnail URL in my 'CMS' gallery.hd.org, at least for body images in the first instance.

This means that I need not copy, minify and check-in to the VCS (Version Control System, eg SVN) body images for EOU if they are already being hosted in the Gallery.

(This also means that the IMG tag remains valid HTML if not translated, even if it will be non-optimal in a number of ways.)

The appropriately scaled images will still be served from directly under /img/a/ rather than by the Gallery, and clicking the image will take the visitor through to the Gallery catalogue page.

Which means that I can drop almost any still raster image in, on whim!

2018-12-02: IMG Helps

The IMG tag is helping me to spruce up existing pages, eg in adding new images to them. Even if I never take AMP pages live, the mechanism is useful. It helps to only need src and class attributes. It is proving helpful to be able to manually set alt sometimes.

So, for example, On Greening Christmas had its rather poor lone image improved and a new one added, and that second image was also added to Low Carbon Family Holidays. I automatically get smaller lower-weight versions for the mobile pages, along with really-low-weight Save-Data versions, and a link back to the source image if too large to be used directly, so possibly containing other information of interest to the visitor. What's not to like?

2018-12-01: Atom Feeds

I freed up a little space on the CRP (Critical Rendering Path) for the home pages (desktop and mobile) and so inserted a header link to the basic Atom site feed. My Firefox "Brief" plugin picks that up and shows a feed button in the URL bar, and I am hoping that other browsers give similar signals.

Soon I shall drop both the RSS/Atom and G+ 'social media' buttons for each page, and so keeping the feed on the home page in this way may be useful.