Advent of Code 2020: Day 19 reflection

Further adventures in parsing

Advent of Code 2021 starts tonight! But before then, I have one more reflection from last year’s challenge to share.

This one is about the day 19 puzzle, which, to be perfectly honest, I did not particularly enjoy.

From the puzzle description:

The rules for valid messages (the top part of your puzzle input) are numbered and build upon each other. For example:
0: 1 2
1: "a"
2: 1 3 | 3 1
3: "b"
Some rules, like 3: "b", simply match a single character (in this case, b).

The remaining rules list the sub-rules that must be followed; for example, the rule 0: 1 2 means that to match rule 0, the text being checked must match rule 1, and the text after the part that matched rule 1 must then match rule 2.

Some of the rules have multiple lists of sub-rules separated by a pipe (|). This means that at least one list of sub-rules must match. (The ones that match might be different each time the rule is encountered.) For example, the rule 2: 1 3 | 3 1 means that to match rule 2, the text being checked must match rule 1 followed by rule 3 or it must match rule 3 followed by rule 1.

Fortunately, there are no loops in the rules, so the list of possible matches will be finite.

[…]

The received messages (the bottom part of your puzzle input) need to be checked against the rules so you can determine which are valid and which are corrupted.

[…]

Your goal is to determine the number of messages that completely match rule 0.

(from https://adventofcode.com/2020/day/19)

Dear reader, the "a" and "b" from the example were not just an example. In my puzzle input file, there were in fact only "a" and "b" character rules, followed by an unwieldy number of “messages” that looked like lots of ababbbab and aabaaaba and baaabbaa.

I thought this was a bit too much.

My original approach

In my first attempt at this puzzle, I thought, “I know, I’ll use regular expressions.”

The input rules format seemed half way to regular expression syntax anyway, so I thought, why not translate rule 0 into one giant regular expression and run all the messages through it?

So the example rule set would turn into something like this:

rule 1: /a/
rule 3: /b/
rule 2: /(ab|ba)/
rule 0: /a(ab|ba)/

This strategy worked just fine for part 1 of the puzzle, but looking back, I will concede that it was unnecessarily complicated.

(It would probably have been better even to build a set of all possible valid messages, and use that to query whether each message in the input file was present in the set of valid messages.)

But when it came to part 2…

As you look over the list of messages, you realize your matching rules aren’t quite right. To fix them, completely replace rules 8: 42 and 11: 42 31 with the following:
8: 42 | 42 8
11: 42 31 | 42 11 31
This small change has a big impact: now, the rules do contain loops, and the list of messages they could hypothetically match is infinite. You’ll need to determine how these changes affect which messages are valid.

Fortunately, many of the rules are unaffected by this change; it might help to start by looking at which rules always match the same set of values and how those rules (especially rules 42 and 31) are used by the new versions of rules 8 and 11.

(Remember, you only need to handle the rules you have; building a solution that could handle any hypothetical combination of rules would be significantly more difficult.)

(from https://adventofcode.com/2020/day/19)

The new rule 8 can still be expressed as a regular expression: it’s just “1 or more” copies of rule 42 (we can use the regex + symbol for this).

It’s rule 11 that throws a monkey wrench into the regular expression strategy. In words, we could describe this rule as “one or more matches of rule 42, followed by an equal number of matches of rule 31”. Sadly this “equal number of matches” can’t be expressed as a regular expression.

In fact, this is literally the first non-example of a regular language in the wikipedia article:

A simple example of a language that is not regular is the set of strings $ \{ a^nb^n \mid n ≥ 0 \} $. Intuitively, it cannot be recognized with a finite automaton, since a finite automaton has finite memory and it cannot remember the exact number of a’s.

(from https://en.wikipedia.org/wiki/Regular_language)

I still hacked together a solution out of my existing work for part 1, but it relied on a couple important details:

Neither rule 42 nor rule 31 (or any rules referenced by them) contained any reference to rule 8 or 11.
As it turned out, all of the possible matches for both rule 42 and rule 31 had the same length (8 characters).

The other crucial detail is that the actual rule 0 (at least in my puzzle input) was this: 0: 8 11.

Combining rules 0, 8, and 11 then, essentially what I was looking for was “two or more matches of rule 42, followed by some smaller (but non-zero) number of matches of rule 31”.

Putting all this together, it meant I could examine a message in chunks of 8 characters. If a message wasn’t a multiple of 8 characters, I could discard it right away; otherwise, I could see how many chunks (starting from the beginning) would match rule 42, and how many chunks (now starting from the end) would match rule 31. Let’s say the whole message was $ n $ chunks in length. If we had $ x $ chunks matching rule 42, and $ y $ chunks matching rule 31, then as long as there was some $ k $ that could satisfy all of

$$ \begin{gather*} 2 \le k \le x \\ 1 \le (n - k) \le y \\ k > n - k \\ \end{gather*} $$

this would indicate a valid match.

Like I said, I eventually got this to work, after a fair bit of trial and error.

But this all left me wondering, what’s the “right way” to do something like this, in the general case? Is there some existing parsing library that’s already solved this problem for us?

GLL parsing to the rescue

I recently came across the packrattle parsing library, which can do just this! (I’m sure there are others that would work too, but this was one in a language I felt reasonably confident that I could use.)

The underlying parsing algorithm is fairly magical to me, but the library is simple to use.

The core of my packrattle-based solution looks like this:

const packrattle = require('packrattle');

function parseRule(p, line) {
  var [id, rule] = line.split(': ');
  id = parseInt(id);
  if (rule[0] === '"') {
    p[id] = packrattle.string(rule[1]);
  } else {
    p[id] = packrattle.alt(...rule.split(' | ').map(
      (e) => packrattle.seq(...e.split(' ').map(
        (i) => {
          i = parseInt(i);
          return () => p[i];
        }))));
  }
}

This function can be used to build up an object p of rule parsers (keyed by the rule ID number), using these packrattle functions:

string() matches a fixed input string (in our case, either "a" or "b")
alt() matches any one of a list of alternatives
seq() matches a list of things in sequence

The rest of the solution reads from standard input, parsing rules until it reaches a blank line, and then switches to feeding input lines to the final p[0] parser, keeping a running count of valid matches.

This certainly works, but can we go a little meta? Why are we messing around with split() to parse the input rules? Why not construct a parser that itself returns parsers?

Behold:

const ruleParser = (function() {
  const ruleId = packrattle.regex(/\d+/).map(m => parseInt(m[0]));
  const character = packrattle.regex(/"(.)"/).map(m => m[1]);
  const ruleSeq = packrattle.repeatSeparated(ruleId, ' ');
  const ruleAlts = packrattle.repeatSeparated(ruleSeq, ' | ');
  const ruleDelim = packrattle.string(': ').drop();
  return packrattle.seq(ruleId, ruleDelim, packrattle.alt(character, ruleAlts));
})();

function parseRule(p, line) {
  var [id, rule] = ruleParser.run(line);
  if (typeof rule === 'string') {
    p[id] = packrattle.string(rule);
  } else {
    p[id] = packrattle.alt(...rule.map(
      ids => packrattle.seq(...ids.map(
        id => () => p[id]))));
  }
}

Is this any more readable that the last version? Not really, no.

But this was a fun exercise in using a parsing library!