Kenneth Jenkins

Advent of Code 2020: Introduction and Day 4 reflection

Or, why doesn’t Python have a split() method for lists?

This year’s Advent of Code is just around the corner! I had a lot of fun with these puzzles last year, so in honor of the occasion I thought I would share some of the things I learned over the course of the challenge.

For the unfamiliar, Advent of Code is an Advent calendar where a new programming puzzle unlocks each day of December (from the 1st through the 25th). Each puzzle has a part 1 and a part 2, where part 2 is usually a more complicated variation on the first part. Puzzles involve writing some code to compute an answer (usually based on some input file data).

I don’t intend to talk about every single puzzle, or even to give complete solutions to a few, but rather to share a few things I found interesting.

I used Python for all of last year’s puzzles. As the high-level structure of each puzzle was fairly similar (read input, compute something for part 1, compute something else for part 2), I got in the habit of starting each day from a skeleton file that looked something like this:

import sys

def parse(lines):
  pass

def part1(p):
  pass

def part2(p):
  pass

def main():
  parsed_input = parse(sys.stdin.readlines())
  print(part1(parsed_input))
  print(part2(parsed_input))

Usually, I would want to take the raw input and translate it into some other format that would be more convenient, so this is what the parse() method was for: maybe I needed to convert from strings to numbers, or maybe build a map of values indexed by (X, Y) coordinates.

When I got to Day 4, I stumbled a bit with the parse() method. Most of the input files so far involved a list of items, with one item per line. This puzzle’s input, however, groups of key:value pairs that could span multiple lines, with a blank line in between separate items.

This was the example input:

ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm

iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929

hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm

hcl:#cfa07d eyr:2025 pid:166559648
iyr:2011 ecl:brn hgt:59in

(These were supposed to represent passport data fields: ecl was “eye color”, pid was “passport ID number”, and so on.)

To make the actual puzzle solving code easier, I wanted to represent each passport as a dictionary, mapping keys (e.g. 'ecl', 'pid') to their values ('gry', '860033327').

My parse() method started with a list of input lines, but I needed to make sure to group these lines together up to the next blank line. I found myself wanting to use something like Python’s string split() method. This method takes a string and “splits” it at any whitespace, returning a new list of strings.

For example:

>>> 'Well, hello there'.split()
['Well,', 'hello', 'there']

I wanted to take my list of input lines and “split” it at every blank line; something like lines.split('\n'). But lines was a list, not a string, and I didn’t see any method like this in the Python standard library.

After some Googling, I found this Stack Overflow question wondering the same thing: is there a split() for lists?

One of the answers pointed out that we can use itertools.groupby() for this, something like this:

def listsplit(l, delim):
    return [list(group) for is_delim, group in
            itertools.groupby(l, key=lambda e: e == delim) if not is_delim]

It took me a minute to understand how this worked, so I thought I’d try to explain it here.

What does itertools.groupby() do?

The Python documentation has this to say:

itertools.groupby(iterable, key=None)

Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged. Generally, the iterable needs to already be sorted on the same key function.

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

I found this a little hard to follow, especially because for our use case we don’t need or want to sort the input.

If we ignore the key parameter for now, the groupby() method essentially takes a sequence of elements and looks for consecutive elements that are the same. It returns an iterator of these individual groups.

A diagram might help. Say we start with a list with some repeated elements. These could be strings, numbers, or what have you, so I’ll use labeled boxes to represent the elements:

diagram of input list

Calling groupby() with this list would group the elements like this:

groups from input list

(Here I’ve used angle brackets to represent an iterator that will return the enclosed items.)

Note that even though B appears twice, because they aren’t consecutive in the input, they are part of separate groups in the output.

However, this isn’t quite an accurate picture. The groupby() method doesn’t just return an iterator of each group of elements. It actually returns an iterator of tuples, where each tuple contains the grouped element followed by the group iterator itself. So I guess the output diagram should really look something like this:

example groupby() output

That’s a lot of brackets….

But let’s come back to that key function. This function lets us group elements in different ways. Let’s say we had a function that returned 1 for a C element and 0 for any other element. If we called groupby() with this key function, it would apply this function to each element and then group according to the results. We could represent that something like this:

groupby() example with key function

Now the first B is grouped with all of the A’s, because the key function gave the same result for all of these elements. (And note that the first element of each tuple is now the 0 or 1 from the key function.)

Back to listsplit()

For the list-splitting case, the key insight (ha!) is that we can use a key function that returns True if an element matches our delimiter, and False otherwise. This way groupby() will return all consecutive non-delimiter elements in one chunk.

Unfortunately, all of our delimiters get returned as an output group too. But, we can filter out these group by checking the key function output values.

Finally, we also need to capture all of the elements from the group iterators, which we can do by making a new list out of each iterator.

Putting it all together, this brings us to the listsplit() method from before:

annoated listsplit() definition

And then using this method, the passport parsing routine is a snap:

def parse(lines):
    return [parse_passport(' '.join(p)) for p in listsplit(lines, '\n')]

def parse_passport(p):
    return dict(kv.split(':') for kv in p.split())

Now we have a list of dictionaries that are easy to query for individual data fields.

Addendum

The real moral of this story might be that it’s important not to get too focused on one approach.

As I was writing up this reflection, I realized that I don’t really even need a listsplit() method at all.

If I hadn’t had readlines() in my solution skeleton already, I might have seen that we can go straight to the input chunks I wanted by reading the entire file at once and using split('\n\n') to divide the input at blank lines.

I could replace the whole parse() method with something like this:

def parse(input):
    return [dict(kv.split(':') for kv in p.split()) for p in input.split('\n\n')]