On deleting code

29 November 2016

This blog post is my journey on creating, debugging and deleting a unit test.

A frequent choice every programmer faces is whether a piece of code belongs in a program. I was working on a hobby project recently where I faced this choice. I had written a number of unit tests and after some reflection deleted one of the test to improve my code quality. At first read this sounds counter-intuitive. Don’t unit tests improve code quality? Generally, yes they do. But good unit tests need to be repeatable, maintainable and quick to run.

In my case I had created a Trie and some unit tests to validate it. In my test code I created a small Trie which contained 11 entries. However, I wanted to test against a larger set. So, I decided to use the dictionary file in /usr/share/dict/words.

I created the following unit test:

def test_dictionary(self):
    t = trie.Trie()
    count = 0
    with open('/usr/share/dict/words') as f:
        for line in f:
            t.insert(line)
            count += 1
    assert(len(t.get_all_words())) == count

This test failed. The len(t.get_all_words()) value was 234,371 but the value of count was 235,886.

I assumed something must be wrong with my code. So, I compared the words returned by t.get_all_words() against the words in the file. I added some print statements to my test code and expected to see 1515 words. However, there was no output. This surprised me!

Maybe my code was correct after all. I decided to look at the file to see if there were empty lines or duplicate entries. Time to use some command line tools! I used the following commands to get a count of the unique words in the file:

$ sort -f -b -u /usr/share/dict/words > file
$ wc -l file
$ 234371

Oh! The numbers matched, meaning there must be duplicate entries. Taking a quick look at the file I saw the first two entries were ‘a’ and ‘A’. In my Trie I convert everything to lower case, so they would be the same entry.

I added a unit test to cover the duplicates case and another to cover leading/trailing spaces.

After adding the new test cases I decided to remove the unit test which used the /usr/share/dict/words file. Why? Because:

I would need to hardcode the value 234,371 into the test.
The location of the /usr/share/dict/words file is hardcoded. This test would fail on a Window’s machine because that file does not exist. In my case this is not a problem because I’m the only one that will use this program. But this would be a bad pattern if I were working with other people.
The unit test takes a few seconds to run. It is a bit annoying waiting for it to run, even if it is only for a few seconds.
The new unit tests cover the cases uncovered by the /usr/share/dict/words file test.

I was a little attached to the test because it led me down a path that improved my test case scenarios. It was also nice to see my Trie work with 234,371 entries. However, it did not meet the requirements for a good unit test. It did not run quickly and it was not repeatable across platforms.

But I think this raises the question: is there a case for testing the Trie with a large data set? Yes, I think so, but not at the unit test level. If this was a real project then I think this could be covered in a volume or verification testing phase.

I won’t claim my code is elegant, but it is better without that test.