Archival cascades: a practical way to not break URLs

26 May 2021

Photo of cascading waterfalls — Cascades are beautiful things.

This week, I reduced our DigitalOcean hosting costs for Small Technology Foundation from ~$90/month to $5/month by moving our canonical source code repositories as well as a few other servers to the complimentary hosting provided to our not-for-profit by the Eclips.is initiative by Greenhost and Open Technology Fund.

As part of the move, I had to decide what to do with the three servers that were still running various parts of the Ind.ie web site:

The latest version of the Ind.ie web site, with a notice that we were now Small Technology Foundation (this was a Hugo site).
The site we had from 2013-2017 (this was server-side-generated site using a custom engine I’d written in Node.js).
The labs site, which linked to a number of projects and held a few technical blog posts (this was a server-side-rendered site with a custom Express server I’d written in Node.js and which was deployed using dokku).

Memories

We haven’t been known as Ind.ie for the past three years now but that doesn’t mean we can just switch those servers off and be done with it. I’m hugely proud of our journey over the past eight years. Most of all in the fact that, as hard as it has been, we are still going. And that we will keep going for the foreseeable future. The Ind.ie site is an archive of the learning, iterating, and growing (in knowledge, understanding, and tools; not in size) that we’ve done through five of those years. Including…

The ethical design manifesto
The Ind.ie Summit and all of the videos from it (which I went through one-by-one to update and fix the video links on… my goodness we all look so young!)
Technical blog posts from the labs section, not to mention a photo of Oskar in a lab coat and red bow tie.¹

The plan

I knew I wanted to collate the three servers into one and use Site.js to host static versions of them.

The latest version of the Ind.ie web site was the easiest to deal with. Site.js has native support for Hugo so all I had to do was to create a .hugo directory in my new site and copy it there.

So my archived Ind.ie site looked like this:

ind.ie/
  ╰ .hugo
       ╰ (contents of the latest ind.ie site)

Next, I needed to serve the statically-generated contents of the site from 2013-2017. Now you’re probably thinking: “but Site.js is already serving the Hugo site and you need to serve that static content from the same namespace. How do you do that?”

Well, there are two ways I could have done it. Since Site.js simply serves any content in the root of your site as static content, I could have just copied the content there. But Site.js also has a feature specifically for this use case called archival cascades.

Archival cascades

If you have static archives of previous versions of your site, you can have Site.js automatically serve them for you.

Just put them into folder named .archive-1, .archive-2, etc.

If a path cannot be found in your current site, Site.js will search for it first in .archive-2 and, if it cannot find it there either, in .archive-1.

Paths in your current site will override those in .archive-2 and those in .archive-2 will, similarly, override those in .archive-1.

This was you create a cascade of archives where you can serve static snapshots of older content.

Using the archival cascade, old links will never die but if you do replace them with newer content in newer versions, those will take precedence.

So, after this step, my archive site structure looked like this:

ind.ie/
  ├ .hugo
  │    ╰ (contents of the latest ind.ie site)
  ╰ .archive-1
       ╰ (contents of the ind.ie site from 2013-2018)

Taking a static snapshot using wget

Since the labs site was server-side rendered, I needed to get a static snapshot of it.² The easiest way I could find of doing that was to use the handy wget command.

I simply ran the server on my local development machine (on port 3000) and ran the following command to save a static snapshot of it:

wget --recursive --domains localhost --no-parent --page-requisites http://localhost:3000

Once I had the static snapshot, I just added it to the archival cascade. So my final site structure looked like this:

ind.ie/
  ├ .hugo
  │    ╰ (contents of the latest ind.ie site)
  ├ .archive-1
  │    ╰ (contents of the ind.ie site from 2013-2018)
  ╰ .archive-2
       ╰ /labs
           ╰ (contents of the ind.ie labs site)

Then, I set up a server running Site.js, pointed the ind.ie domain to it, and synced the site over.

Don’t break URLs

It’s all too simple to just turn a server off and forget about it but the web doesn’t forget. What you leave behind will be a bunch of dead links. It’s not always practical to keep everything running forever but if you want to try and not break links, archival cascades and 404 to 302 support in Site.js should help.

Like this? Fund us!

Small Technology Foundation is a tiny, independent not-for-profit.

We exist in part thanks to patronage by people like you. If you share our vision and want to support our work, please become a patron or donate to us today and help us continue to exist.

Some of the links to articles on Labs might be broken as they refer to a forum that no longer exists. This was hosted by a commerical third-party and, sadly, we weren’t able to get a usable static export of our data at the time. ↩︎
As an alternative, if I had wanted to, I could have kept the server running somewhere and used the native 404 to 302 in Site.js to keep serving the old site. ↩︎