Earth Notes: On Website Technicals (2018-04)

Updated 2024-02-25.

Tech updates: reading time, jpegtran to jpegultrascan, Primitive, SVG, Save-Data, Sitebulb.

This month involved a lot of work making the site experience faster for those on slow links and small devices, and preventing automatic injection of ads into 'lite' pages was a good part of that effort. Also improving internal linking seems to have been useful, given recovery in search traffic a few months later...

2018-04-29: Internal Linking

One potential problem that Sitebulb highlighted was that some of the articles are only linked to through the sitemap. I haven't been diligent enough with my manual "See also" sidebar links to show visitors the other (related) content that I have of possible interest to them.

More generally, this shows up as pages with low numbers of inbound internal links at the left-hand end of Sitebulb's Links pane "Unique Followed Internal Links/URLs by Percentile" graphs.

Much of my content is only of interest to me I suspect, so nothing of value was lost! Technically all articles can indeed be found thorough both the XML and HTML sitemaps. So search engines (SEs) can find everything. And I don't mind people dipping in for one page from a search. Glad to be of service.

But I can do better.

So, experimentally, I am now auto-injecting a link to a 'similar' article in those full-fat pages already large enough to bear a table of contents. At the moment I am judging similarity by tags, but length and readability are candidates too. A link is not injected if apparently already present somewhere on the page. And a link to self is off the cards!

I pondered quick ways to do this while publishing each page, helpful to the user, and helpful to me to promote appropriate articles. The current solution is to see if some of the newest articles are a good fit, then some that are not currently popular. Those are filtered as described above. The first surviving match is used.

I believe that this is low-key and useful, but I'll stop if not.

These embedded links will change relatively slowly, since the pages are fully static. They can only change when the page is rebuilt for some reason.

If the links to currently less popular pages makes them more popular, then this whole system may oscillate slowly!

2018-04-21: Sitebulb

I have spent a while trying out the trial version of the Sitebulb desktop website crawler.

It has certainly turned up a few errors (some subtle) in EOU. There are many 'hints' that it provides that I respectfully disagree with.

Even though I am on the free trial, Sitebulb has been very responsive to bug reports from me (some bogus, on reflection)!

During my short time using Sitebulb, I upgraded (smoothly!) from 2.0.0 though 2.0.1 to 2.0.2. The installer did useful things such as offering to trash the download file once the install finished.

For someone providing services to a number of clients, and who can take a professional view on the suggestions made by Sitebulb, I think that this is an excellent tool for their tool bag.

Product: Sitebulb Website Crawler 2.0.2

Website auditing tool for SEO consultants and agencies, for Mac and Windows.

Brand: Sitebulb
MPN: 2.0.2

Review summary

14-day free trial
Potentially a very useful tool for a consultant, or for an owner of multiple sites generating revenue. Sitebulb desktop website crawler performs a thorough crawl and cross-check of many aspects of a site's content and behaviour. Tested on x64 Mac OS X laptop. Support is good.
Rating: 4/5
Published: 2018-04-21
Updated: 2021-12-02

See my 5.4.0 review.

2018-04-17: Save-Data and Lite HTML

I still have no easy way of measuring how much the Save-Data hint is used with this site. Where it is used, image weight drops.

It occurs to me that it may be good to (temporarily, 302) redirect to the mobile page equivalent any main-page HTML request made with Save-Data. At the cost of a round-trip time and a little HTTP overhead on the first such page, this should reduce the weight of the HTTP and HTML, and many subsequent images even further. And further pages viewed (on the lite site) should be lightened too. (This would also require adding a suitable Vary header for such pages, for cache consistency.)

To allow for a visitor with Save-Data on, who still wishes to see the full site HTML, simply with lighter images, then the redirect could be inhibited if the referrer is from the EOU lite or full site. For example, if they are redirected to a 'lite' page, and then explicitly click on the 'full' link, they would not be bounced back. Once clicking from one full page to another they would not be bounced back either.

Alternatively, as for the 'L' lo-fi versions of image files where available, the lite page content could be directly substituted for the full, with no redirect. Simpler, saves some time and bandwidth, and only Save-Data needs to be in the Vary, not Referer also. The 'full' link would apparently be broken and may cause confusion at that point. There is a small amount of (non-critical) content that the user would not be able to get to without disabling Save-Data. But the user can turn off Save-Data if they want access to that. The user has control.

The total weight of the home page for the 'full' site is now about 60--70kB (from over 100kB at the start of the month). The lite home page weighs in at a little over 30kB. All of these numbers are without the Save-Data header invoked, so would fall further. It would be reasonable to hope for a 3-fold drop in weight total if a Save-Data visitor to the home page was fed 'lite' page content too.

There many need to be some special care taken with Google/Bing verification HTML files. In general, as for images, we might simply not attempt redirection unless the target 'lite' page exists.

This all would have to interact nicely with the .htmlgz pre-zipped pages (and .htmlbr for brotli in future). Some effort would be needed to avoid combinatorial explosion.

This is ugly and complex enough to warrant some live unit tests to ensure correct behaviour, and that it is maintained!

Also, the fact that most browsers treat Vary as a validator means that a client switching between Save-Data and not would effectively flush its cache each time, possibly increasing data usage.

2018-04-16: SVG Background Inline Image Fun

Inlined SVG got me thinking!

I have a file containing CSS to signal warning text with a backdrop of grey question marks:

.warning{color:#000;background-image:url(img/q.jpg)}

(The warning class has been applied to this section.)

That piece of CSS isn't large (52 bytes pre-compression), but pulls in more than 600 bytes of not very lovely JPEG file. And that implies another HTTP connection, at least the first time.

However, by using inline SVG the CSS is now 177 bytes (before compression), and no extra HTML connection is needed. The 'image' is not cacheable, but it isn't used in many places either, so no great loss.

.warning{background-image:url("data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' height='256' width='256'><text y='99' font-size='99' fill='lightgray'>?</text></svg>")}

I needed to tweak one of the HTML files to keep its header small enough when using the new CSS. (So that some body content will arrive in the first compressed TCP frame.)

2017-04-21 update: for safety and conformance a few of the characters must be URL encoded. A full URL encoding is unnecessary, and would be fatter than base64. The main ones that cannot be avoided are <, > and #. (Embedded quotes the same as the enclosing ones, non-ASCII and URL-unsafe characters such as '%' also need encoding.) I also moved the y='99' and changed the colour to hex in hope of improved compression. Now 190 bytes before compression.

.warning{background-image:url("data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' height='256' width='256'%3e%3ctext font-size='99' y='99' fill='%23ccc'%3e?%3c/text%3e%3c/svg%3e")}

2018-04-15: Dir Lite: SuppressDescription

In Apache's mods-available/autoindex.conf config file I have added SuppressDescription to the IndexOptions line. I don't think that the descriptions were ever populated with anything useful, so just wasted space. And also forced search engines to uselessly explore the extra implied state space to find that out.

I am doing this while deciding if/how to ban all access to the site with query strings, given that the main site content is entirely static.

I am partly provoked by clumsy hacking attempts and by log entries such as:

... [15/Apr/2018:14:18:52 +0000] "GET /out/hourly/button/intico1-48.png?rand=1523801932121 HTTP/1.1" 200 1628 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
... [15/Apr/2018:14:19:08 +0000] "GET /out/hourly/button/intico1-48.png?rand=1523801948119 HTTP/1.1" 200 1628 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
... [15/Apr/2018:14:19:24 +0000] "GET /out/hourly/button/intico1-48.png?rand=1523801964131 HTTP/1.1" 200 1628 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
... [15/Apr/2018:14:19:39 +0000] "GET /out/hourly/button/intico1-48.png?rand=1523801979136 HTTP/1.1" 200 1628 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"

The perp here has artificially added a 'cache-buster' query parameter, presumably because they can't work out how to properly check with an If-Modified-XXX request. This image only gets updated once every ten minutes!

The problem with an outright ban is legitimate referrals such as:

... [15/Apr/2018:14:19:05 +0000] "GET /low-carbon-investing.html?utm_source=feedburner&utm_medium=twitter&utm_campaign=Feed%3A+EarthNotesBasicFeed+%28Earth+Notes+Basic+Feed%29 HTTP/1.1" 200 14081 "-" "Java/1.7.0_151"

Maybe redirects with empty/removed query strings for HTML docs, and rejections for most everything else, would work.

Even more simple, just strip query parameters (this is for Apache 2.2):

RewriteCond %{QUERY_STRING} .
RewriteRule ^/(.*) /$1? [L,R=301]

For consistency I should then probably add IgnoreClient to the IndexOptions config line also, so that clients are not invited to add query parameters at all. Thus, for the EOU site (not across all sites as above) I have added:

<IfModule mod_autoindex.c>
<Directory />
    IndexOptions +IgnoreClient
</Directory>
</IfModule>

2018-04-13: primitive

I am contemplating making inline 'lite' hero images again, eg for the front page column headers. That would allow even fewer separate HTTP fetches for images. And those column header heroes are rarely used in other situations, so any loss of cacheing is minimal.

I am drawn again to the the idea of inline SVG placeholders built with primitive.js and optimised with svgo.

There's an online demo for primitive.js. With just eight triangles in SVG this is possible:

8-triangle 'primitive' rendering of pumpkin.

See the original JPEG. The output was passed through svgo to get to 735 bytes, and gzips to 398 bytes.

For SVG that is going to be inlined, the xmlns apparently may be omitted for further byte savings. The svgo removeXMLNS plugin may do this.

Note that since the primitive generation is randomised, a different (equivalent) rendering may be produced each run, which may not play well with long Cache-Control max-age and byte-range fetching... On the other hand on a new run a smaller output could be retained each time.

The 'quality' slider to control output size is the number of primitives drawn.

I prefer the look of pure triangle renderings. (Elipses and rectangles are also available.)

For images with any inherent complexity, the rendering with (say) 8--16 triangles, corresponding to an on-the-wire compressed size of under 1kB, is often poor. In other words it's difficult to beat my current 'L' compression for JPEG and PNG and have anything recognisable. But is is definitely possible to make something moderately pretty and small enough to inline.

Inline SVG apparently does not interact well with responsive design, ie constructs such as max-width set to 100%. Also I don't know how to do the equivalent of an "alt" tag. I could possibly get round both of those using an img tag with a base64-encoded data URL, at the cost of some size bloat. Maybe for small hero images that should never need to shrink, these issues are not critical.

As an interesting aside, html-minifier strips quotes from the SVG attribute values as if HTML5 rather than XML or XHTML. That does not seem to break anything, and saves some bytes, but is it safe? The html-minifier docs mention that "SVG tags are automatically recognized, and when they are minified, both case-sensitivity and closing-slashes are preserved, regardless of the minification settings used for the rest of the file," but does not mention attribute values. I note that my Opera Mini, Safari, Firefox and Chrome browsers all display the minified SVG fine in the lite version of this page.

At the moment I cannot easily install a primitive.js to use from the CLI.

2018-04-12: Ads off Lite Again!

I turned off most ads for mobile/lite a few months ago given the page weight, and their apparent ineffectiveness. I turned them on again experimentally 2018-03-26. Though the ads gained a reasonable number of impressions, and looked quite reasonable to me on my Opera Mini, I earned a grand total of 20p in that time, while bumping up typical page weight from ~30kB to over 10x that!

ToC? Tick!

Separately, I've also flagged up harder-to-read tech and research texts with a tag in the Contents (ToC) line. Only for 'full' pages though.

2018-04-10: jpegultrascan

I spun the wheel on the jpegultrascan.pl front-end for jpegtran: a "JPEG lossless recompressor that tries all scan possibilities to minimize size."

It's systematic and portable: a single Perl script that uses jpegtran. It is also very slow! (As are many of the "squeeze out the last drops" tools in this space, such as zopflipng.)

See the log from running jpegultrascan against all existing hero images. They had already been trimmed with jpegtran, eg using custom scan scripts. The log only includes images for which there was any size reduction. In total, 100 out of 250 JPEG hero images.

As with jpregrescan, looking at common simple patterns in the scans it generated suggested some static ones to try manually. This simply has the components re-ordered to (I hope) reduce any green Martian effect:

0;
2;
1;

There are just a few compressions above 3%, requiring some very bizarre scans:

ASHP-fan.l79593.640x80.jpg.scans:Change: -3.604914%
fan-sq.l403381.800x200.jpg.scans:Change: -5.404713%
water-closet-trough-bad-arrangement-AJHD.l146741.800x200.jpg.scans:Change: -3.035947%

For example, for the fan:

0: 0 0 0 0;
1: 0 0 0 0;
2: 0 0 0 0;
0: 1 4 0 0;
1: 1 4 0 0;
0: 5 5 0 0;
0: 6 6 0 0;
0: 7 7 0 0;
0: 8 8 0 0;
0: 9 12 0 0;
0: 13 13 0 0;
0: 14 14 0 0;
0: 15 16 0 0;
0: 17 17 0 0;
0: 18 19 0 0;
0: 20 20 0 0;
0: 21 22 0 0;
0: 23 24 0 0;
0: 25 31 0 0;
0: 32 33 0 0;
0: 34 38 0 0;
0: 39 63 0 0;
1: 5 63 0 0;
2: 1 8 0 0;
2: 9 63 0 0;

2018-04-08: jpegrescan

I have done some very preliminary testing with jpegrescan on my Mac, which tries different scan patterns to maximise compression. There may up to ~2% further compression available, but some scan patterns may result in green Martians. So I don't think that I want to actually use jpegrescan as-is.

(In any case, I had difficulty finding a pre-packaged version that worked on the RPi. Nothing apparently available via apt-get, and the npm package broken somehow...)

Looking at some of the scan patterns that jpegrescan generates, and existing patterns that I have, suggests some simple alternatives. This synthesis has one less pass than "semi-progressive" and seems useful:

0 1 2: 0 0 0 0;
2: 1 63 0 0;
1: 1 63 0 0;
0: 1 63 0 0;

That sends all the DC components first, then the (less important) chroma components' AC, then finally the luma AC.

Sample output from running script/lossless_JPEG_compress *.jpg*:

INFO: file     4216 shrunk to     4119 (semi-semi-progressive) hero/OpenTRV-Green-Challenge-entry-outtake.l91208.800x200.jpgL
INFO: file     6551 shrunk to     6509 (semi-semi-progressive) hero/SS-MPPT-15L.l176004.800x200.jpgL
INFO: file     5075 shrunk to     5015 (semi-semi-progressive) hero/TV-replacement.l100518.800x200.jpgL
INFO: file    15179 shrunk to    15165 (semi-semi-progressive) hero/ZWD14581W.l148865.800x200.jpg

Visually the results look acceptable, at least for a few samples. So I probably haven't missed anything critical out.

% file img/autogen/hero/*.jpg* | egrep progressive | wc -l
      71
% file img/autogen/hero/*.jpg* | egrep baseline | wc -l
     147

% file img/autogen/hero/*.jpgL | egrep baseline | wc -l
      83
% file img/autogen/hero/*.jpgL | egrep progressive | wc -l
      26

Having added a couple more scans files, tested to be useful, the worst-case difference compared to jpegrescan is under 2%.

I am tempted to run jpegrescan unconditionally for images that will be inlined in the 'lite' pages to maximise any space savings. Green Martians won't be an issue since these images are small enough that intermediate states should never (or very rarely) be visible. Saving bytes in the HTML is good though.

One odd feature while processing 'grayscale' (Y-only) JPEG images... On the Mac they are seen as single channel by mediainfo and jpegtran. Trying to run a 3-channel scan script on such an image causes an error and abort in jpegtran in fact. On the RPi side the same images are seen and processed as if YUV 3-channel. Odd.

2018-04-07: jpegtran

I downloaded jpegtran (for macOS brew install jpeg, for RPi Raspbian sudo apt-get install libjpeg-progs). This performs lossless JPEG transformations.

I created a little JPEG post-processor that tries progressive, non-progressive, and two scan scripts suggested by Cloudinary on each generated hero JPEG. The smallest is retained, if any is smaller than the original. (Most are not.)

The script that I use to do this can be applied as a batch to all the .jpg* autogenerated hero images quite quickly and easily. So I can try out new ideas with relatively low pain. Though an error could trash everything and require a slow rebuild!

Typical savings are small. Where available they appear to be typically single digit percent, though there are examples over 10%. (This, however, is similar to the advantage of zopfli over gzip, and thus applying zopflipng at a similar PNG postprocessing stage.)

The results seem to support the rule of thumb of ~10kB or above being better compressed with progressive. There are plenty of exceptions though.

I am now by and large sending hero images smaller than 10kB to mobiles. Thus by happy accident I am still usually saving CPU time and battery for them, though being driven entirely by file size at this final stage.

2018-04-05: Reduced Weight 'Simple' Hero Images

When I auto-generate hero images I already impose maxima for size in bytes (much lower for 'lite' that 'full'), bits per pixel (again, lower for 'lite'), and JPEG quality and PNG bits per colour channel.

That generally results in a fairly compact but reasonable-looking image to keep page weight down. Hero images are mainly decoration rather than information, so they should not dominate page weight.

But I noticed that some 'simple' images, eg JPEGs with a lot of sky, were still bigger than they needed to be. Various image analyses were claiming that I could do better within the JPEG format.

So I added a new cap on size. First I have Imagemagick create a version of the hero image without being given a specific 'quality' value, so it will try to replicate that of the original. I don't allow the final hero image to be larger than that version. The aim is to avoid generating a hero image with an artificially high 'quality' and thus size, with no visible benefit.

This new cap knocks 10kB/30% off some of the 800x200 hero images on the 'full' site, without immediately discernible artefacts. The analyses available within WebPageTest, built-in and via Cloudinary, now indicate that the hero images are compressed to within a few percent of maximum.

(I was going to do something more clever with capping image similarity to below a 'perfect' version of the hero image, but that did not work well. Maybe I'll revisit...)

However, this extra compression is probably sailing close to the wind. I may crank this back if my heroes begin to look too shabby.

Note that this new cap nominally applies to hero images for the 'lite' site as well. But they are already constrained enough for this not to make any difference in practice.

(There may be a tiny amount of extra compression to be had flipping between progressive and 'normal' JPEG rendering losslessly, eg with jpegtran. Later...)

2018-04-03: Reading Time Flagged Up In Contents

Some Web pages show a notion of "reading time" at the start. I have mixed feelings about this, in that it can feel a bit patronising for a start. But it can also provide a heads-up before accidentally launching into a major read. Given the variability of article length on this site, that seems like a helpful hint.

Interestingly, when trying to find out how others computed reading time, I found that most do it by words. But I note that there's a huge spread of assumed words-per-minute (WPM) reading speeds. (From ~80WPM to ~200WPM seems mainstream.) Some of that may depend on the material and some on the audience. Also, the complexity of the text will matter. And indeed how accurate the word count is!

Taking all that into consideration I have initially plumped for a middle-of-the-road 150WPM presumed reading speed for my text and audience.

I also hit on a good place to show the estimate reading time: in the (unexpanded) table of contents. That location seems semantically appropriate. It also does not take up any extra space on the page, given the current layout.

I don't show read time where it would be less than a couple of minutes. Now where there is generated text inserted into the page, since the estimate would likely be wrong. Nor indeed where no table of contents is shown.

After some faffing around I've decided not to include this on 'lite' pages, even though it's only about a dozen bytes. 'Lite' should in general be stripped of redundant information, and this is indeed redundant.