Thoughts on Mutation testing in Python

Writing code is fun.

Writing tests can be less fun, but is a great way to have code that is more likely to work as desired.

Using a coverage tool will show you what percentage of your program is executed when you run your tests, but getting 100% code coverage does not mean your code is 100% tested.

For instance, testing a function:

def product(a, b):
  return a * b

If your tests are not carefully written, you may end up with tests that pass when they really should not. For instance:

>>> product(2, 2)
4

This execution shows only that it works for that choice of values. But how many sets of values do we need to test in order to be satisfied that a function does what is expected?

I happened across a document about Mutation Testing some time ago. It was actually new to me (and to a couple of people I respect), but it was actually quite interesting. The idea behind it is that you can change bits of the program, re-run your tests, and if your tests still succeed, then your tests are not sufficient.

Mutation Testing, then, is testing your tests.

For instance, mutating the * in the previous function to + results in a passing test, even though the code has changed (and, to us at least, is clearly different). Thus, we need to improve our test(s).

In theory, we should be able to mutate each statement in our program in all the ways it could be mutated, and if any of these mutants are not killed by our tests, then our tests are insufficient.

In practice, this is untenable. Some mutants will be equivalent (for instance, a * b is the same as b * a), but more so, there will be many, many mutants created for all but the simplest program.


Aside: doctest

Python has a fantastic library called doctest. This enables you to write tests as part of the function definition. I’ll use that here, as it makes writing the tests simple.

def product(a, b):
  """Multiply two numbers together.

  >>> product(2, 3)
  6
  """
  return a * b

They are really useful for documentation that is an accurate representation of what the code does. They don’t replace unit tests (indeed, you’ll want your unit tests to run your doctests), but they are good fun.

This investigation will use doctests, as then the tests for a module are always self-contained.


Selection of mutations

With a suitably complicated program, the number of possible mutations will be very, very large. We will, for now, consider only applying one mutation per mutant, doing otherwise would result in huge numbers of possible mutants.

There is quite a bit of academic research into Mutation Testing (indeed, if I were an academic, I’d probably write a paper or two on it), and one part that is relatively well documented is the discussion of mutation selection. That is, how can we reduce the number of mutant programs (and make our tests run in a satisfactory time), without missing mutations that could show how our tests may be improved.

Offtut et al (1996) discusses how we can reduce the number of mutation operators, and come up with the conclusion that with just five operators, sufficient tests can be determined.

Using the classifications from Mothra, they are:

  • ABS: ABSolute value insertion. Force arithmetic expressions to take on (-, 0, +) values.
  • AOR: Arithmetic Operator Replacement. Replace every arithmetic operator with every syntactically legal other arithmetic operator.
  • LCR: Logical Connectior Replacement. Replace every logical connector (and, or) with several kinds of logical connectors.
  • ROR: Relational Operator Replacement. Replace every relational operator with every other relational operator.
  • UOI: Unary Operator Insertion. Insert unary operators in front of expressions. In the case of python, the unary + operator will not be added, as that usually results in an equivalent mutant. I’m actually also going to include swapping unary operators as part of this operation.

These mutation operations make sense in python. For instance, in our toy example, we’d see the following (showing only the return statement from our product function):

# original
return a * b

# ABS
return -1
return 0
return 1
return a * -1
return a * 0
return a * 1
return -1 * b
return 0 * b
return 1 * b

# AOR
return a + b
return a - b
return a / b
return a // b
return a ** b
return a % b

# UOI
return -a * b
return a * -b
return -(a * b)
return not a * b
return a * not b
return not (a * b)
return ~a * b
return a * ~b
return ~(a * b)

We can see from our toy example that there are no LCR or ROR mutations possible, since there are no logical or relational operations or connectors. However, in the case of python, we could see code like:

# original
return a < b and a > 0

# LCR
return a < b or a > 0
return a < b and not a > 0
return a < b or not a > 0

# ROR
return a > b and a > 0
return a >= b and a > 0
return a <= b and a > 0
return a != b and a > 0
return a == b and a > 0

return a < b and a < 0
return a < b and a <= 0
return a < b and a == 0
return a < b and a >= 0
return a < b and a != 0

Limiting ourself to just these five mutation operations is great: it simplifies our mutation code immensely.

Within the context of python, I actually think there should also be some other mutations. A common mistake in programmer code is to omit a call to super(), or missing the explicit self argument in a method definition. For now, we’ll stick to the five above though.

I’m guessing the extreme age of Mothra means it wasn’t used in the case of object-oriented programming languages, even more so with multiple inheritance!

Okay, we are done with the generic stuff, time to get specific.


Aside: ast

Another great module that is included in python is ast. This allows you to build an Abstract Syntax Tree from (among other things) a string containing python code. Along with astor, which allows you to rewrite it as python code, after performing our mutation operation(s).

import ast
tree = ast.parse(open(filename).read())
ast.fix_missing_locations(tree)

The ast.fix_missing_locations stuff fixes up any line numbers, which we may use for reporting later.


Mutating the AST.

Bingo, now we have an abstract syntax tree containing our module. ast also contains classes for walking this tree, which we can subclass to do interesting things. For instance, to collect all statements that should be mutated, we can do something like:

SWAPS = {
  ast.Add: [ast.Mult, ast.Sub, ast.Div, ast.Mod, ast.Pow, ast.FloorDiv],
  # etc, etc
}

class Collector(ast.NodeVisitor):
  def __init__(self, *args, **kwargs):
    self.mutable_nodes = []

  def __visit_binary_operation(self, node):
    # Visit our left child node.
    self.visit(node.left)
    # Create all possible mutant nodes for this, according to AOR
    self.mutable_nodes.extend([
      (node, node.__class__(
        left=node.left, op=op(), right=node.right
      )) for op in SWAPS.get(node.op.__class__, [])
    ])
    # Visit our right child node.
    self.visit(node.right)

  # These three node types are handled identically.
  visit_BinOp = __visit_binary_operation
  visit_BoolOp = __visit_binary_operation
  visit_Compare = __visit_binary_operation

  def visit_UnaryOp(self, node):
    # Create all possible mutant nodes, according to UOI.
    self.mutable_nodes.extend([
      (node, node.__class__(
        op=op(), operand=node.operand)
      ) for op in SWAPS.get(node.op.__class__)
    ])
    # Also, create a node without the operator.
    self.mutable_nodes.append((node, node.operand))
    self.visit(node.operand)

Our mutator code is actually much simpler. Since we keep a reference to the nodes we want to mutate (and the new node it should be replaced with), we can just swap them each time a given node is visited:

class Mutator(ast.NodeTransformer):
  def __init__(self, original, replacment):
    self.original = original
    self.replacment = replacment

  def __swap(self, node):
    # Handle child nodes.
    self.generic_visit(node)

    # Don't swap out the node if it wasn't our target node.
    if node not in [self.original, self.replacement]:
        return node

    # Swap the node back if we are being visited again.
    if node == self.replacement:
        return self.original

    # Otherwise, swap the node out for the replacement.
    return self.replacement

  # We can just use this same function for a whole stack of visits.
  visit_BinOp = __swap
  visit_BoolOp = __swap
  visit_Compare = __swap
  visit_UnaryOp = __swap

Note that calling mutator.visit(tree) on a mutated tree will revert the mutation.

To use these, assuming we have a special function test that runs the tests we want:

tree = ast.parse(open(filename).read())
ast.fix_missing_locations(tree)

results = test(tree)
if results.failed:
    raise Exception("Unable to run tests without mutations.")

# Collect all of the possible mutants.
collector = Collector()
collector.visit(tree)

survivors = []

for (node, replacement) in collector.mutable_nodes:
    mutator = Mutator(node, replacement)
    # Apply our mutation
    mutator.visit(tree)
    ast.fix_missing_locations(tree)

    try:
        results = test(tree)
    except Exception:
        # Looks like this mutant was DOA.
        continue

    if not results.failed:
        survivors.append((node, replacement, results))

    # Revert our mutation
    mutator.visit(tree)

This is a bit of a simplification, but it’s actually pretty close to working code. We use multiple processes to run it in parallel (and also have a timeout based on the initial test run time, assuming we should not take more than twice as long), and compile the tree into a module to test it.

You may see the current version at pymutant. Keep in mind that this is little more than a proof of concept at this stage.


Testing our testing

So, let’s look at some toy examples, and see what the outcomes are.

def product(a, b):
    """
    >>> product(2, 2)
    4
    >>> product(2, 3)
    6
    """
    return a * b

def addition(a, b):
    """
    >>> addition(1, 0)
    1
    >>> addition(1, 1)
    2
    """
    return a + b

def negate(a):
    """
    >>> negate(1)
    -1
    >>> negate(-1)
    1
    """
    return -a

Running in a debug mode, we can view the mutants, and the results:

TestResults(failed=2, attempted=6) (a * b) -> (not (a * b))
TestResults(failed=1, attempted=6) (a * b) -> (a ** b)
TestResults(failed=2, attempted=6) (a * b) -> (a // b)
TestResults(failed=2, attempted=6) (a * b) -> 1
TestResults(failed=2, attempted=6) (a * b) -> (a % b)
TestResults(failed=2, attempted=6) (a * b) -> 0
TestResults(failed=2, attempted=6) (a * b) -> (- (a * b))
TestResults(failed=2, attempted=6) (a * b) -> (~ (a * b))
TestResults(failed=2, attempted=6) (a * b) -> True
TestResults(failed=2, attempted=6) (a * b) -> (-1)
TestResults(failed=2, attempted=6) (a * b) -> False
TestResults(failed=2, attempted=6) (a * b) -> (a / b)
TestResults(failed=1, attempted=6) (a * b) -> (a + b)
TestResults(failed=2, attempted=6) return (a * b) -> pass
TestResults(failed=2, attempted=6) (a * b) -> None
TestResults(failed=2, attempted=6) (a * b) -> (a - b)
TestResults(failed=2, attempted=6) (a + b) -> (not (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (a % b)
TestResults(failed=1, attempted=6) (a + b) -> (a - b)
TestResults(failed=1, attempted=6) (a + b) -> (a ** b)
TestResults(failed=2, attempted=6) (a + b) -> (~ (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (-1)
TestResults(failed=1, attempted=6) (a + b) -> 1
TestResults(failed=2, attempted=6) (a + b) -> (- (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (a // b)
TestResults(failed=2, attempted=6) (a + b) -> None
TestResults(failed=2, attempted=6) (a + b) -> 0
TestResults(failed=1, attempted=6) (a + b) -> True
TestResults(failed=2, attempted=6) (a + b) -> False
TestResults(failed=2, attempted=6) return (a + b) -> pass
TestResults(failed=2, attempted=6) (a + b) -> (a / b)
TestResults(failed=2, attempted=6) (a + b) -> (a * b)
TestResults(failed=2, attempted=6) (- a) -> 0
TestResults(failed=2, attempted=6) (- a) -> (- (- a))
TestResults(failed=2, attempted=6) return (- a) -> pass
TestResults(failed=2, attempted=6) (- a) -> (not (- a))
TestResults(failed=1, attempted=6) (- a) -> True
TestResults(failed=2, attempted=6) (- a) -> False
TestResults(failed=1, attempted=6) (- a) -> (-1)
TestResults(failed=2, attempted=6) (- a) -> None
TestResults(failed=1, attempted=6) (- a) -> 1
TestResults(failed=2, attempted=6) (- a) -> (~ (- a))

===============
Mutation Report
===============

* Generated 42 mutants, and tested in 0.107773065567 seconds.

* 0 of these mutants were unable to execute correctly.

* 0 of these mutants were killed for taking too long to execute.

* Tests killed of 42 of the remaining mutants, leaving 0 survivors.

* Your innoculation rate is 100%.

Well, that’s all nice, but there’s more we can think about than this. What about tests that are “useless”? Tests that never fail, for instance?

Unfortunately, doctest.testmod() only returns the count of failures and attempts (and looking into that module, I’m not sure that which tests passed/failed is actually stored). It would be really nice to be able to capture this, but perhaps that is a task for a unittest-based approach.

What about a slightly more complex example?

 1 def aggregate(items):
 2     """
 3     Aggregate a list of 2-tuples, which refer to start/finish values.
 4 
 5     Returns a list with overlaps merged.
 6 
 7     >>> aggregate([])
 8     []
 9     >>> aggregate([(1, 3)])
10     [(1, 3)]
11     >>> aggregate([(1, 3), (2, 6)])
12     [(1, 6)]
13     >>> aggregate([(1, 3), (4, 6)])
14     [(1, 3), (4, 6)]
15     >>> aggregate([(3, 4), (1, 9)])
16     [(1, 9)]
17     """
18 
19     # Sort our items first, by the first value in the tuple. This means we can
20     # iterate through them later.
21     sorted_items = sorted(items)
22 
23     i = 0
24     while i < len(sorted_items) - 1:
25         current = sorted_items[i]
26         next = sorted_items[i + 1]
27 
28         if current[1] >= next[1]:
29             # Skip over the next item totally.
30             sorted_items.remove(next)
31             continue
32 
33         if current[1] >= next[0]:
34             # Merge the two items.
35             sorted_items[i:i+2] = ((current[0], next[1]),)
36             continue
37 
38         i += 1
39 
40     return sorted_items

And when we test with mutants (no debug mode this time, as we have 169 mutants):

===============
Mutation Report
===============

* Generated 169 mutants, and tested in 0.905333995819 seconds.

* 0 of these mutants were unable to execute correctly.

* 61 of these mutants were killed for taking too long to execute.

* Tests killed of 99 of the remaining mutants, leaving 9 survivors.

* Your innoculation rate is 94%.

Survivor Report
===============

0 at line 23, changed to False
(i < (len(sorted_items) - 1)) at line 24, changed to (- (i < (len(sorted_items) - 1)))
(i + 1) at line 26, changed to (i - 1)
(i + 1) at line 26, changed to (- (i + 1))
(i + 1) at line 26, changed to 1
(i + 1) at line 26, changed to (-1)
(i + 1) at line 26, changed to True
(current[1] >= next[1]) at line 28, changed to (- (current[1] >= next[1]))
(current[1] >= next[1]) at line 28, changed to (current[1] > next[1])

Timeout Report
==============

0 at line 23, changed to (-1)
...

I’ve omitted the remainder timeout report.

But, this does show us that our tests are incomplete, and perhaps what we should be doing to fix this.

In particular, note the group of mutants at line 26 that all survived: indicating that this particular line of code is not being tested well at all.

Perhaps the biggest takeaway (and an indicator of how Mutation Testing may be really useful) is the last listed mutant. It’s showing that this particular comparison is clearly not being tested for off-by-one errors.