Why I wrote 152 extra lines of code just to do the same thing (and why I’d do it again today)

24 Oct 2020

An old bound dot-matrix printout of computer code. — Who else remembers printing out code on a dot matrix printer? Ah, those were the days… (Image courtesy Arnold Reinhold.)

At the end of the week, I added the following regular expressions to JavaScript Database (JSDB), to sanitise queries:

// Disallow list.
this.query = this.query.replace(/[;\\\+\`\{\}\$]/g, '')

// Allow list.
let sieve = this.query
  .replace(/valueOf\..+?\.toLowerCase()\.(startsWith|endsWith|includes|startsWithCaseInsensitive|endsWithCaseInsensitive|includesCaseInsensitive)\(.+?\)/g, '')
  .replace(/valueOf\..+?\.(startsWith|endsWith|includes|startsWithCaseInsensitive|endsWithCaseInsensitive|includesCaseInsensitive)\(.+?\)/g, '')
  .replace(/valueOf\.[^\.]+?\s?(===|!==|<|>|<=|>=)\s?([\d_\.]+\s?|'.+?'|".+?"|false|true)/g, '')
  .replace(/\|\|/g, '')
  .replace(/\&\&/g,'')
  .replace(/['"\(\)\s]/g, '')

They’re pretty straightforward.

(As far as regular expressions go, that is.)

The first expression strips out characters that can be used to carry out arbitrary code execution via injection attacks and the rest are used in a sieve to remove all the input we expect to be there.

We then check to see if anything is left behind in the sieve. If the sieve is not empty, we treat it as a possible attack and return an empty result set, as shown below.

if (sieve !== '') {
  // Sieve not empty, reject.
  this.#cachedResult = []
} else {
  // Run the query.
  this.#cachedResult = this.data.filter(valueOf => {
    try {
      return eval(this.query)
    } catch (error) {
      // Eval failed. Possible injection attack, return false.
      return false
    }
  })
}

I’ve simplified the comments in the code snippets but, otherwise, they are as I wrote them a couple of days ago.

JSDB, now with 20% more code!

Now, that a look at what that code looks like as of yesterday (and keep in mind that it does exactly the same thing).

The first thing you might notice is that it is now a 300-line file, including comments.

Wow, what happened?

Two things:

I am now composing my regular expressions from constants instead of using regular expression literals. e.g.,

static join() { return Array.from(arguments).join('') }
static anyCharacter = '.'
static oneOrMore    = '+'
static nonGreedy    = '?'

static get oneOrMoreCharactersNonGreedy () {
  return this.join(
    this.anyCharacter,
    this.oneOrMore,
    this.nonGreedy
  )
}

I now get the values for the allow list directly from the source (from QueryOperators.js which is the same code that is used by the query generation methods).

class QueryOperators {
  static STRICT_EQUALS = '==='
  static STRICT_DOES_NOT_EQUAL = '!=='

  static RELATIONAL_OPERATORS = {
    is: this.STRICT_EQUALS,
    isEqualTo: this.STRICT_EQUALS,
    equals: this.STRICT_EQUALS,

    isNot: this.STRICT_DOES_NOT_EQUAL,
    doesNotEqual: this.STRICT_DOES_NOT_EQUAL,

    isGreaterThan: '>',
    isGreaterThanOrEqualTo: '>=',
    isLessThan: '<',
    isLessThanOrEqualTo: '<='
  }

  static FUNCTIONAL_OPERATORS = [
    'startsWith',
    'endsWith',
    'includes',
    'startsWithCaseInsensitive',
    'endsWithCaseInsensitive',
    'includesCaseInsensitive'
  ]

  static get uniqueListOfRelationalOperators () {
  return Array.from(new Set(Object.values(QueryOperators.RELATIONAL_OPERATORS)))
  }
}

// …

class RE {
  // …
  static anyFunctionalOperator = this.anyOf(...QueryOperators.FUNCTIONAL_OPERATORS)
  static anyRelationalOperator = this.anyOf(...QueryOperators.uniqueListOfRelationalOperators)
}

// …

class Allowed {
  // …
  // Statements in the form valueOf.<property> === <value>, etc.
  // /valueOf\.[^\.]+?\s?(===|!==|>|>=|<|<=)\s?([\d\._]+\s?|'.+?'|".+?"|false|true)/g
  static relationalOperations = globalRegExp(
    RE.literal.valueOfDot,
    RE.oneOrMoreCharactersNonGreedyThatAreNotDots,
    RE.zeroOrMoreWhitespace,
    RE.anyRelationalOperator,
    RE.zeroOrMoreWhitespace,
    RE.anyValidRelationalOperationValue
  )
}

The effects, subsequently, are three-fold:

Two positives

The intent and clarity of the code is improved.
The maintainability of the code is improved and it is now safer as I can update the query generation code without worrying that I am going to forget to update the sanitisation code or that I might make typos while copying values over manually into regular expression literals.

And one negative

I’ve added 152 lines of extra code to the project.¹

So do the first two positive effects outweigh the cost of the third, negative one? Especially on a tool that had 755 lines of code to begin with? (The additional lines of code amount to a 20% increase of the codebase).

Definitely.

Fewer lines of code doesn’t always mean better code.

While less code generally means an easier to maintain and safer application, this axiom does not exist in a vacuum. In over three decades of programming, I’ve come to respect that clarity of intent is the most important factor when writing code.²

Is what I wrote what I had actually meant to write?

The following is confusing, especially to someone who isn’t versed in regular expressions:

/valueOf\.[^\.]+?\s?(===|!==|<|>|<=|>=)\s?([\d_\.]+\s?|'.+?'|".+?"|false|true)/g

Whereas the code below, that composes the regular expression from constants, can be read by almost anyone – even a non-programmer – to get the gist of what’s happening:

static relationalOperations = globalRegExp(
  RE.literal.valueOfDot,
  RE.oneOrMoreCharactersNonGreedyThatAreNotDots,
  RE.zeroOrMoreWhitespace,
  RE.anyRelationalOperator,
  RE.zeroOrMoreWhitespace,
  RE.anyValidRelationalOperationValue
)

Perhaps even more importantly, if you look at where the constants are mapped to their regular expression counterparts, you can immediately see whether the regular expression I wrote actually expresses what I thought I was writing because I use the following declarative form:

const thisIsWhatIWantToArchive = 'This is how I think I archive it'

So, now, anyone who knows regular expressions can read through the code and immediately see if there is a mistmatch between what I think I’m asking the computer to do and what I’m actually asking it to do. Furthermore, it is also easy to spot errors in the former for those cases where what I’m trying to do itself is not the right thing.

class RegularExpressionLiterals {
  static dot = '\\.'
  // …
}

class RE {
  // Concatenates into a string.
  static join () {
    return Array.from(arguments).join('')
  }

  // Creates negated set (e.g., [^xy]).
  static negatedSetOf () {
    return `[^${Array.from(arguments).join('')}]`
  }

  static literal = RegularExpressionLiterals
  static nonGreedy    = '?'
  static oneOrMore    = '+'

  // [^\\.]+?
  static get oneOrMoreCharactersNonGreedyThatAreNotDots () {
    return this.join(
      this.negatedSetOf(
        this.literal.dot
      ),
      this.oneOrMore,
      this.nonGreedy
    )
  }
}

In conclusion

If you’ve made it this far, I’d really appreciate it if you could take a quick look at the JSDB Query Sanitiser (and maybe even the much simpler String sanitiser in the JSDF parser) and see if you can spot anything I’ve missed. If you do notice anything amiss, I’d appreciate it if you could open an issue and let me know.

What? Did I write this blog post just to get more eyes on the security code for my new database?

Tsk, tsk…

Maybe ;)

PS. I do hope you enjoy using JSDB. I’m currently working on integrating it into Site.js for use in Small Web sites and apps.

Like this? Fund us!

Small Technology Foundation is a tiny, independent not-for-profit.

We exist in part thanks to patronage by people like you. If you share our vision and want to support our work, please become a patron or donate to us today and help us continue to exist.

The statistics are compiled using scc. ↩︎
I am still happy to see that the whole of JSDB is just ~900 lines of code (not counting tests, and 1,443 lines counting tests with 100% code coverage). ↩︎