What if data was code?

23 Sep 2020

Spiderman pointing at Spiderman meme. — Code? Data? Data? Code?

Update: This is now a thing and it’s not called WhatDB but JavaScript Database (JSDB). Read more about it in my Introducing JSDB post.

This is a little idea I’m throwing around for WhatDB, the transparent in-memory JavaScript database for local server data that I’m working on for Site.js, the Small Web construction set:

What if I don’t persist the database as a serialised JSON structure but as the series of changes that created it instead?

What do I mean by that? Take a look at this quick proof-of-concept:

Example

write.js

const fs = require('fs')

const writeStream = fs.createWriteStream('data.js')
writeStream.write('const data = []\n')
writeStream.write('module.exports = data\n')

for (let i = 0; i < 1000; i++) {
  writeStream.write(`data.push(${i})\n`)
}

writeStream.close()

read.js

const data = require('./data.js')
console.log(data)

First, run write.js, then run read.js. You should see the 1,000 item array logged to the console.

Why?

Even though WhatDB is for small amounts of data (since the Small Web discourages storage of privileged data on servers), having to execute a full JSON serialisation and write to disk on every write does not sit right with me.

Stringification is a bitch

Don’t get me wrong, it is definitely fast enough for smaller data sets. But even with 100,000 records of my test dataset I didn’t want to block the event loop in Node.js for the 40-50ms that it takes to stringify the database on every write.

When streaming isn’t enough (to lower CPU usage)

So I implemented streaming JSON serialisation using the excellent json-stream-stringify and that reduced the loop delay to the sub-ms region. But I was still maxing out CPU.

On a 1,000,000 record database (again, overkill for Small Web), that meant 100%+ CPU for the 40 seconds or so that serialisation was taking place. Actually writing the serialised data to disk, atomically, was trivial in comparison (a couple of seconds).

So at this point I have something that works for its intended use case and I can just call it a day and move on but I can’t. Even if it’s within acceptable parameters, it feels wasteful. (Is there such a thing as a code conscience? Who knows…)

Anyway, so today I started thinking out loud on the fediverse about how I could perform delta (partial) edits on the serialised data string instead of serialising the whole thing every time and that led to us throwing out ideas spanning CRDTs, piece tables, and even Kappa architectures before I came to realise that all that would be overkill and instead…

A simple idea for a simple database

What if I just stored all the instructions that are needed to create each table/object graph instead of a serialised copy of the data itself?

In fact, what if the data itself was a Node module?

This is the point that some folks will stare at me with horror in their eyes. The sacrilege… the heresy! What will they do, I wonder, when they learn that I’ve done far, far darker things…¹

Properties of such a system

Might be slow during initial object graph creation.

This is fine given that the initial object graph is created at server launch alongside other synchronous tasks. Performance isn’t a concern at that point.
Will take up more disk space than serialisation.

That’s ok, given disk space isn’t the bottleneck in an in-memory Node.js database, Node’s memory limitations are.
Fast reads.

This is a property of having an in-memory data store that is not affected by this decision.
Fast writes.

Since the file format is basically an append-only log, streaming writes are quick and easy on the event loop and system resources.
Require snapshots?

Since we’re losing the atomic writes due to the appends, it would make sense to have periodic snapshots.
Require compaction?

We could compact on initial load (on server start²) to ensure that deleted data is actually deleted. Compaction would basically result in a single-line JSON.parse statement on the full serialised table.

Can’t wait to get my hands dirty and experiment with this in WhatDB.

Thanks

Thanks to everyone on the fediverse (and that other proprietary platform) for being my rubber duck and for providing me with great resources and food for thought. Also, I just learned that what I’m doing has a name: object prevalence.

This is not the first time I’ve played with a similar concept. Back in the Flash days (yes, I’m that old), I made a data format called SWX that basically allowed you to load data wrapped in the native SWF format. Thank goodness this time I don’t have to clean-room reverse-engineer SWF bytecode. If you’re into history, you can read about SWX in the chapter R. Jon MacDonald wrote on it for our The Essential Guide to Open Source Flash Development book. ↩︎
Or every nth restart or on restart after n days, etc. ↩︎