Unicode Flags in Python

I can’t even remember how I got onto this idea.

I’ve been an avid user of django-countries, which is a really nice way to have a country field, without having to maintain your own database of countries.

One neat feature is that Chris includes flag icons for all countries. I have some code in my project that uses these to have the correct flag shown next to the country select box whenever you change the selection.

However, these flags are small, and cannot resize easily (and require a request to fetch the image). Then it occurred to me that new iOS/OS X (and probably other platforms) now support Emoji/Unicode flags.

A bit of research found Unicode’s encoding of national flags is just crazy enough to work, which discusses how the system works: basically the two-letter (ISO 3166-1 alpha-2 code) is used, but instead of just “AU”, it uses the Regional Indicator Symbol. When two valid characters are used one after another, they are combined into a single glyph.

We just need to add an offset to the integer value of each character in the code, and convert this back into a unicode character.

The offset can be calculated from knowing the start of the Regional Indicator Symbol range:

python3
>>> ord('🇦')
127462

To generate a flag glyph in python, you can use the following:

OFFSET = ord('🇦') - ord('A')

def flag(code):
    return chr(ord(code[0]) + OFFSET) + chr(ord(code[1]) + OFFSET)

This only works with Python 3, and we can expand it a bit to make it more robust:

OFFSET = ord('🇦') - ord('A')

def flag(code):
    if not code:
        return u''
    points = map(lambda x: ord(x) + OFFSET, code.upper())
    try:
        return chr(points[0]) + chr(points[1])
    except ValueError:
        return ('\U%08x\U%08x' % tuple(points)).decode('unicode-escape')

This relies on the fact that Python 2 will raise a ValueError when attempting to chr() a value greater than 256. Whilst we could use unichr(), this fails on systems that have python compiled without wide unicode support. Luckily my local system was that case here.

Thoughts on Mutation testing in Python

Writing code is fun.

Writing tests can be less fun, but is a great way to have code that is more likely to work as desired.

Using a coverage tool will show you what percentage of your program is executed when you run your tests, but getting 100% code coverage does not mean your code is 100% tested.

For instance, testing a function:

def product(a, b):
  return a * b

If your tests are not carefully written, you may end up with tests that pass when they really should not. For instance:

>>> product(2, 2)
4

This execution shows only that it works for that choice of values. But how many sets of values do we need to test in order to be satisfied that a function does what is expected?

I happened across a document about Mutation Testing some time ago. It was actually new to me (and to a couple of people I respect), but it was actually quite interesting. The idea behind it is that you can change bits of the program, re-run your tests, and if your tests still succeed, then your tests are not sufficient.

Mutation Testing, then, is testing your tests.

For instance, mutating the * in the previous function to + results in a passing test, even though the code has changed (and, to us at least, is clearly different). Thus, we need to improve our test(s).

In theory, we should be able to mutate each statement in our program in all the ways it could be mutated, and if any of these mutants are not killed by our tests, then our tests are insufficient.

In practice, this is untenable. Some mutants will be equivalent (for instance, a * b is the same as b * a), but more so, there will be many, many mutants created for all but the simplest program.


Aside: doctest

Python has a fantastic library called doctest. This enables you to write tests as part of the function definition. I’ll use that here, as it makes writing the tests simple.

def product(a, b):
  """Multiply two numbers together.

  >>> product(2, 3)
  6
  """
  return a * b

They are really useful for documentation that is an accurate representation of what the code does. They don’t replace unit tests (indeed, you’ll want your unit tests to run your doctests), but they are good fun.

This investigation will use doctests, as then the tests for a module are always self-contained.


Selection of mutations

With a suitably complicated program, the number of possible mutations will be very, very large. We will, for now, consider only applying one mutation per mutant, doing otherwise would result in huge numbers of possible mutants.

There is quite a bit of academic research into Mutation Testing (indeed, if I were an academic, I’d probably write a paper or two on it), and one part that is relatively well documented is the discussion of mutation selection. That is, how can we reduce the number of mutant programs (and make our tests run in a satisfactory time), without missing mutations that could show how our tests may be improved.

Offtut et al (1996) discusses how we can reduce the number of mutation operators, and come up with the conclusion that with just five operators, sufficient tests can be determined.

Using the classifications from Mothra, they are:

  • ABS: ABSolute value insertion. Force arithmetic expressions to take on (-, 0, +) values.
  • AOR: Arithmetic Operator Replacement. Replace every arithmetic operator with every syntactically legal other arithmetic operator.
  • LCR: Logical Connectior Replacement. Replace every logical connector (and, or) with several kinds of logical connectors.
  • ROR: Relational Operator Replacement. Replace every relational operator with every other relational operator.
  • UOI: Unary Operator Insertion. Insert unary operators in front of expressions. In the case of python, the unary + operator will not be added, as that usually results in an equivalent mutant. I’m actually also going to include swapping unary operators as part of this operation.

These mutation operations make sense in python. For instance, in our toy example, we’d see the following (showing only the return statement from our product function):

# original
return a * b

# ABS
return -1
return 0
return 1
return a * -1
return a * 0
return a * 1
return -1 * b
return 0 * b
return 1 * b

# AOR
return a + b
return a - b
return a / b
return a // b
return a ** b
return a % b

# UOI
return -a * b
return a * -b
return -(a * b)
return not a * b
return a * not b
return not (a * b)
return ~a * b
return a * ~b
return ~(a * b)

We can see from our toy example that there are no LCR or ROR mutations possible, since there are no logical or relational operations or connectors. However, in the case of python, we could see code like:

# original
return a < b and a > 0

# LCR
return a < b or a > 0
return a < b and not a > 0
return a < b or not a > 0

# ROR
return a > b and a > 0
return a >= b and a > 0
return a <= b and a > 0
return a != b and a > 0
return a == b and a > 0

return a < b and a < 0
return a < b and a <= 0
return a < b and a == 0
return a < b and a >= 0
return a < b and a != 0

Limiting ourself to just these five mutation operations is great: it simplifies our mutation code immensely.

Within the context of python, I actually think there should also be some other mutations. A common mistake in programmer code is to omit a call to super(), or missing the explicit self argument in a method definition. For now, we’ll stick to the five above though.

I’m guessing the extreme age of Mothra means it wasn’t used in the case of object-oriented programming languages, even more so with multiple inheritance!

Okay, we are done with the generic stuff, time to get specific.


Aside: ast

Another great module that is included in python is ast. This allows you to build an Abstract Syntax Tree from (among other things) a string containing python code. Along with astor, which allows you to rewrite it as python code, after performing our mutation operation(s).

import ast
tree = ast.parse(open(filename).read())
ast.fix_missing_locations(tree)

The ast.fix_missing_locations stuff fixes up any line numbers, which we may use for reporting later.


Mutating the AST.

Bingo, now we have an abstract syntax tree containing our module. ast also contains classes for walking this tree, which we can subclass to do interesting things. For instance, to collect all statements that should be mutated, we can do something like:

SWAPS = {
  ast.Add: [ast.Mult, ast.Sub, ast.Div, ast.Mod, ast.Pow, ast.FloorDiv],
  # etc, etc
}

class Collector(ast.NodeVisitor):
  def __init__(self, *args, **kwargs):
    self.mutable_nodes = []

  def __visit_binary_operation(self, node):
    # Visit our left child node.
    self.visit(node.left)
    # Create all possible mutant nodes for this, according to AOR
    self.mutable_nodes.extend([
      (node, node.__class__(
        left=node.left, op=op(), right=node.right
      )) for op in SWAPS.get(node.op.__class__, [])
    ])
    # Visit our right child node.
    self.visit(node.right)

  # These three node types are handled identically.
  visit_BinOp = __visit_binary_operation
  visit_BoolOp = __visit_binary_operation
  visit_Compare = __visit_binary_operation

  def visit_UnaryOp(self, node):
    # Create all possible mutant nodes, according to UOI.
    self.mutable_nodes.extend([
      (node, node.__class__(
        op=op(), operand=node.operand)
      ) for op in SWAPS.get(node.op.__class__)
    ])
    # Also, create a node without the operator.
    self.mutable_nodes.append((node, node.operand))
    self.visit(node.operand)

Our mutator code is actually much simpler. Since we keep a reference to the nodes we want to mutate (and the new node it should be replaced with), we can just swap them each time a given node is visited:

class Mutator(ast.NodeTransformer):
  def __init__(self, original, replacment):
    self.original = original
    self.replacment = replacment

  def __swap(self, node):
    # Handle child nodes.
    self.generic_visit(node)

    # Don't swap out the node if it wasn't our target node.
    if node not in [self.original, self.replacement]:
        return node

    # Swap the node back if we are being visited again.
    if node == self.replacement:
        return self.original

    # Otherwise, swap the node out for the replacement.
    return self.replacement

  # We can just use this same function for a whole stack of visits.
  visit_BinOp = __swap
  visit_BoolOp = __swap
  visit_Compare = __swap
  visit_UnaryOp = __swap

Note that calling mutator.visit(tree) on a mutated tree will revert the mutation.

To use these, assuming we have a special function test that runs the tests we want:

tree = ast.parse(open(filename).read())
ast.fix_missing_locations(tree)

results = test(tree)
if results.failed:
    raise Exception("Unable to run tests without mutations.")

# Collect all of the possible mutants.
collector = Collector()
collector.visit(tree)

survivors = []

for (node, replacement) in collector.mutable_nodes:
    mutator = Mutator(node, replacement)
    # Apply our mutation
    mutator.visit(tree)
    ast.fix_missing_locations(tree)

    try:
        results = test(tree)
    except Exception:
        # Looks like this mutant was DOA.
        continue

    if not results.failed:
        survivors.append((node, replacement, results))

    # Revert our mutation
    mutator.visit(tree)

This is a bit of a simplification, but it’s actually pretty close to working code. We use multiple processes to run it in parallel (and also have a timeout based on the initial test run time, assuming we should not take more than twice as long), and compile the tree into a module to test it.

You may see the current version at pymutant. Keep in mind that this is little more than a proof of concept at this stage.


Testing our testing

So, let’s look at some toy examples, and see what the outcomes are.

def product(a, b):
    """
    >>> product(2, 2)
    4
    >>> product(2, 3)
    6
    """
    return a * b

def addition(a, b):
    """
    >>> addition(1, 0)
    1
    >>> addition(1, 1)
    2
    """
    return a + b

def negate(a):
    """
    >>> negate(1)
    -1
    >>> negate(-1)
    1
    """
    return -a

Running in a debug mode, we can view the mutants, and the results:

TestResults(failed=2, attempted=6) (a * b) -> (not (a * b))
TestResults(failed=1, attempted=6) (a * b) -> (a ** b)
TestResults(failed=2, attempted=6) (a * b) -> (a // b)
TestResults(failed=2, attempted=6) (a * b) -> 1
TestResults(failed=2, attempted=6) (a * b) -> (a % b)
TestResults(failed=2, attempted=6) (a * b) -> 0
TestResults(failed=2, attempted=6) (a * b) -> (- (a * b))
TestResults(failed=2, attempted=6) (a * b) -> (~ (a * b))
TestResults(failed=2, attempted=6) (a * b) -> True
TestResults(failed=2, attempted=6) (a * b) -> (-1)
TestResults(failed=2, attempted=6) (a * b) -> False
TestResults(failed=2, attempted=6) (a * b) -> (a / b)
TestResults(failed=1, attempted=6) (a * b) -> (a + b)
TestResults(failed=2, attempted=6) return (a * b) -> pass
TestResults(failed=2, attempted=6) (a * b) -> None
TestResults(failed=2, attempted=6) (a * b) -> (a - b)
TestResults(failed=2, attempted=6) (a + b) -> (not (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (a % b)
TestResults(failed=1, attempted=6) (a + b) -> (a - b)
TestResults(failed=1, attempted=6) (a + b) -> (a ** b)
TestResults(failed=2, attempted=6) (a + b) -> (~ (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (-1)
TestResults(failed=1, attempted=6) (a + b) -> 1
TestResults(failed=2, attempted=6) (a + b) -> (- (a + b))
TestResults(failed=2, attempted=6) (a + b) -> (a // b)
TestResults(failed=2, attempted=6) (a + b) -> None
TestResults(failed=2, attempted=6) (a + b) -> 0
TestResults(failed=1, attempted=6) (a + b) -> True
TestResults(failed=2, attempted=6) (a + b) -> False
TestResults(failed=2, attempted=6) return (a + b) -> pass
TestResults(failed=2, attempted=6) (a + b) -> (a / b)
TestResults(failed=2, attempted=6) (a + b) -> (a * b)
TestResults(failed=2, attempted=6) (- a) -> 0
TestResults(failed=2, attempted=6) (- a) -> (- (- a))
TestResults(failed=2, attempted=6) return (- a) -> pass
TestResults(failed=2, attempted=6) (- a) -> (not (- a))
TestResults(failed=1, attempted=6) (- a) -> True
TestResults(failed=2, attempted=6) (- a) -> False
TestResults(failed=1, attempted=6) (- a) -> (-1)
TestResults(failed=2, attempted=6) (- a) -> None
TestResults(failed=1, attempted=6) (- a) -> 1
TestResults(failed=2, attempted=6) (- a) -> (~ (- a))

===============
Mutation Report
===============

* Generated 42 mutants, and tested in 0.107773065567 seconds.

* 0 of these mutants were unable to execute correctly.

* 0 of these mutants were killed for taking too long to execute.

* Tests killed of 42 of the remaining mutants, leaving 0 survivors.

* Your innoculation rate is 100%.

Well, that’s all nice, but there’s more we can think about than this. What about tests that are “useless”? Tests that never fail, for instance?

Unfortunately, doctest.testmod() only returns the count of failures and attempts (and looking into that module, I’m not sure that which tests passed/failed is actually stored). It would be really nice to be able to capture this, but perhaps that is a task for a unittest-based approach.

What about a slightly more complex example?

 1 def aggregate(items):
 2     """
 3     Aggregate a list of 2-tuples, which refer to start/finish values.
 4 
 5     Returns a list with overlaps merged.
 6 
 7     >>> aggregate([])
 8     []
 9     >>> aggregate([(1, 3)])
10     [(1, 3)]
11     >>> aggregate([(1, 3), (2, 6)])
12     [(1, 6)]
13     >>> aggregate([(1, 3), (4, 6)])
14     [(1, 3), (4, 6)]
15     >>> aggregate([(3, 4), (1, 9)])
16     [(1, 9)]
17     """
18 
19     # Sort our items first, by the first value in the tuple. This means we can
20     # iterate through them later.
21     sorted_items = sorted(items)
22 
23     i = 0
24     while i < len(sorted_items) - 1:
25         current = sorted_items[i]
26         next = sorted_items[i + 1]
27 
28         if current[1] >= next[1]:
29             # Skip over the next item totally.
30             sorted_items.remove(next)
31             continue
32 
33         if current[1] >= next[0]:
34             # Merge the two items.
35             sorted_items[i:i+2] = ((current[0], next[1]),)
36             continue
37 
38         i += 1
39 
40     return sorted_items

And when we test with mutants (no debug mode this time, as we have 169 mutants):

===============
Mutation Report
===============

* Generated 169 mutants, and tested in 0.905333995819 seconds.

* 0 of these mutants were unable to execute correctly.

* 61 of these mutants were killed for taking too long to execute.

* Tests killed of 99 of the remaining mutants, leaving 9 survivors.

* Your innoculation rate is 94%.

Survivor Report
===============

0 at line 23, changed to False
(i < (len(sorted_items) - 1)) at line 24, changed to (- (i < (len(sorted_items) - 1)))
(i + 1) at line 26, changed to (i - 1)
(i + 1) at line 26, changed to (- (i + 1))
(i + 1) at line 26, changed to 1
(i + 1) at line 26, changed to (-1)
(i + 1) at line 26, changed to True
(current[1] >= next[1]) at line 28, changed to (- (current[1] >= next[1]))
(current[1] >= next[1]) at line 28, changed to (current[1] > next[1])

Timeout Report
==============

0 at line 23, changed to (-1)
...

I’ve omitted the remainder timeout report.

But, this does show us that our tests are incomplete, and perhaps what we should be doing to fix this.

In particular, note the group of mutants at line 26 that all survived: indicating that this particular line of code is not being tested well at all.

Perhaps the biggest takeaway (and an indicator of how Mutation Testing may be really useful) is the last listed mutant. It’s showing that this particular comparison is clearly not being tested for off-by-one errors.

Postgres Domains and Triggers

It occurred to me tonight (after rewriting my Tax File Number generation script in JavaScript, so it works on Text Expander Touch), that it might be nice to do TFN validation in Postgres.

You can write a function that validates a TFN (this is not the first version I wrote, this one is a bit more capable in terms of the range of values it accepts):

CREATE OR REPLACE FUNCTION valid_tfn(TEXT)
RETURNS BOOLEAN AS $$

WITH weights AS (
  SELECT
    row_number() OVER (), weight
  FROM unnest('{1,4,3,7,5,8,6,9,10}'::integer[]) weight
),
digits AS (
  SELECT
    row_number() OVER (),
    digit
  FROM (
    SELECT
      unnest(digit)::integer as digit
    FROM regexp_matches($1, '^(\d)(\d)(\d)[- ]?(\d)(\d)(\d)[- ]?(\d)(\d)(\d)$') AS digit
  ) digits
)

SELECT
  COALESCE(sum(weight * digit) % 11 = 0, FALSE)
FROM weights INNER JOIN digits USING (row_number);

$$ LANGUAGE SQL IMMUTABLE;

Once you have this (which will incidentally limit to 9 digits, and optional spacer items of - or ` `), you may create a DOMAIN, that validates values:

CREATE DOMAIN tax_file_number AS TEXT
CONSTRAINT valid_tfn CHECK (valid_tfn(VALUE));

Now, we can test it:

# SELECT valid_tfn('123-456-789') ; --> FALSE
# SELECT valid_tfn('123 456 782') ; --> TRUE

However, we might want to convert our data into a canonical “format”: in this case, always store it as XXX-XXX-XXX. We can write a function that does this:

CREATE OR REPLACE FUNCTION normalise_tfn(text)
RETURNS tax_file_number AS $$

SELECT string_agg(block, '-'::text)::tax_file_number
FROM (
  SELECT unnest(value) AS block
  FROM regexp_matches($1, '(\d\d\d)', 'g') value
) value;

$$ LANGUAGE SQL IMMUTABLE;

But how can we make our data that we insert always stored in this format?

Postgres triggers to the rescue. We can’t do it as part of a Domain (although we probably could do it as part of a scalar type, but that’s a whole other kettle of fish).

Let’s set up a table to store our tax declarations in:

CREATE TABLE tax_declaration
(
  tax_declaration_id SERIAL PRIMARY KEY,
  tfn tax_file_number,
  lodgement_date date
);

And now create a trigger function and trigger:

CREATE OR REPLACE FUNCTION rewrite_tfn() RETURNS TRIGGER AS $$

BEGIN
  NEW.tfn = normalise_tfn(NEW.tfn);
  RETURN NEW;
END;

$$ LANGUAGE plpgsql;

CREATE TRIGGER rewrite_tfn BEFORE INSERT OR UPDATE ON tax_declaration
FOR EACH ROW EXECUTE PROCEDURE rewrite_tfn();

Now, we can insert some data:

INSERT INTO tax_declaration (tfn, lodgement_date)
VALUES ('123456782', now()::date);

And now we can query it:

SELECT * FROM tax_declaration;
 tax_declaration_id |     tfn     | lodgement_date
--------------------+-------------+----------------
                  1 | 123-456-782 | 2015-10-13
(1 row)

Django second AutoField

Sometimes, your ORM just seems to be out to get you.

For instance, I’ve been investigating a technique for the most important data structure in a system to be essentially immuatable.

That is, instead of updating an existing instance of the object, we always create a new instance.

This requires a handful of things to be useful (and useful for querying).

  • We probably want to have a self-relation so we can see which object supersedes another. A series of objects that supersede one another is called a lifecycle.
  • We want to have a timestamp on each object, so we can view a snapshot at a given time: that is, which phase of the lifecycle was active at that point.
  • We should have a column that unique per-lifecycle: this makes for querying all objects of a lifecycle much simpler (although we can use a recursive query for that).
  • There must be a facility to prevent multiple heads on a lifecycle: that is, at most one phase of a lifecycle may be non-superseded.
  • The lifecycle phases needn’t be in the same order, or really have any differentiating features (like status). In practice they may, but for the purposes of this, they are just “what it was like at that time”.

I’m not sure these ideas will ever get into a released product, but the work behind them was fun (and all my private work).

The basic model structure might look something like:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = models.AutoField(primary_key=False, editable=False)

    superseded_by = models.OneToOneField('self',
        related_name='supersedes',
        null=True, blank=True, editable=False
    )
    timestamp = models.DateTimeField(auto_now_add=True)

    # Any other fields you might want...

    objects = PhaseQuerySet.as_manager()

So, that looks nice and simple.

Our second AutoField will have a sequence generated for it, and the database will give us a unique value from a sequence when we try to create a row in the database without providing this column in the query.

However, there is one problem: Django will not let us have a second AutoField in a model. And, even if it did, there would still be some problems. For instance, every time we attempt to create a new instance, every AutoField is not sent to the database. Which breaks our ability to keep the lifecycle_id between phases.

So, we will need a custom field. Luckily, all we really need is the SERIAL database type: that creates the sequence for us automatically.

class SerialField(object):
    def db_type(self, connection):
        return 'serial'

So now, using that field type instead, we can write a bit more of our model:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = SerialField(editable=False)
    superseded_by = models.OneToOneField('self', ...)
    timestamp = models.DateTimeField(auto_now_add=True)

    def save(self, **kwargs):
        self.pk = None
        super(Phase, self).save(**kwargs)

This now ensures each time we save our object, a new instance is created. The lifecycle_id will stay the same.

Still not totally done though. We currently aren’t handling a newly created lifecycle (which should be handled by the associated postgres sequence), nor are we marking the previous instance as superseded.

It’s possible, using some black magic, to get the default value for a database column, and, in this case, execute a query with that default to get the next value. However, that’s pretty horrid: not to mention it also runs an extra two queries.

Similarly, we want to get the phase_id of the newly created instance, and set that as the superseded_by of the old instance. This would require yet another query, after the INSERT, but also has the sinister side-effect of making us unable to apply the not-superseded-by-per-lifecycle requirement.

As an aside, we can investigate storing the self-relation on the other end - this would enable us to just do:

    def save(self, **kwargs):
        self.supersedes = self.pk
        self.pk = None
        super(Phase, self).save(**kwargs)

However, this turns out to be less useful when querying: we are much more likely to be interested in phases that are not superseded, as they are the “current” phase of each lifecycle. Although we could query, it would be running sub-queries for each row.

Our two issues: setting the lifecycle, and storing the superseding data, can be done with one Postgres BEFORE UPDATE trigger function:

CREATE FUNCTION lifecycle_and_supersedes()
RETURNS TRIGGER AS $$

  BEGIN
    IF NEW.lifecycle_id IS NULL THEN
      NEW.lifecycle_id = nextval('phase_lifecycle_id_seq'::regclass);
    ELSE
      NEW.phase_id = nextval('phase_phase_id_seq'::regclass);
      UPDATE app_phase
        SET superseded_by_id = NEW.phase_id
        WHERE group_id = NEW.group_id
        AND superseded_by_id IS NULL;
    END IF;
  END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER lifecycle_and_supersedes
  BEFORE INSERT ON app_phase
  FOR EACH ROW
  EXECUTE PROCEDURE lifecycle_and_supersedes();

So, now all we need to do is prevent multiple-headed lifecycles. We can do this using a UNIQUE INDEX:

CREATE UNIQUE INDEX prevent_hydra_lifecycles
ON app_phase (lifecycle_id)
WHERE superseded_by_id IS NULL;

Wow, that was simple.

So, we have most of the db-level code written. How do we use our model? We can write some nice queryset methods to make getting the various bits easier:

class PhaseQuerySet(models.query.QuerySet):
    def current(self):
        return self.filter(superseded_by=None)

    def superseded(self):
        return self.exclude(superseded_by=None)

    def initial(self):
        return self.filter(supersedes=None)

    def snapshot_at(self, timestamp):
        return filter(timestamp__lte=timestamp).order_by('lifecycle_id', '-timestamp').distinct('lifecycle_id')

The queries generated by the ORM for these should be pretty good: we could look at sticking an index on the lifecycle_id column.

There is one more thing to say on the lifecycle: we can add a model method to fetch the complete lifecycle for a given phase, too:

    def lifecycle(self):
        return self.model.objects.filter(lifecycle_id=self.lifecycle_id)

(That was why I used the lifecycle_id as the column).


Whilst building this prototype, I came across a couple of things that were also interesting. The first was a mechanism to get the default value for a column:

def database_default(table, column):
    cursor = connection.cursor()
    QUERY = """SELECT d.adsrc AS default_value
               FROM   pg_catalog.pg_attribute a
               LEFT   JOIN pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum)
                                                   = (d.adrelid,  d.adnum)
               WHERE  NOT a.attisdropped   -- no dropped (dead) columns
               AND    a.attnum > 0         -- no system columns
               AND    a.attrelid = %s::regclass
               AND    a.attname = %s"""
    cursor.execute(QUERY, [table, column])
    cursor.execute('SELECT {}'.format(*cursor.fetchone()))
    return cursor.fetchone()[0]

You can probably see why I didn’t want to use this. Other than the aforementioned two extra queries, it’s executing a query with data that comes back from the database. It may be possible to inject a default value into a table that causes it to do Very Bad Things™. We could sanitise it, perhaps ensure it matches a regular expression:

NEXTVAL = re.compile(r"^nextval\('(?P<sequence>[a-zA-Z_0-9]+)'::regclass\)$")

However, the trigger-based approach is nicer in every way.

The other thing I discovered, and this one is really nice, is a way to create an exclusion constraint that only applies if a column is NULL. For instance, ensure that no two classes for a given student overlap, but only if they are not superseded (or deleted).

ALTER TABLE "student_enrolments"
ADD CONSTRAINT "prevent_overlaps"
EXCLUDE USING gist(period WITH &&, student_id WITH =)
WHERE (
  superseded_by_id IS NULL
  AND
  status <> 'deleted'
);

Django Proxy Model Relations

I’ve got lots of code I’d do a different way if I were to start over, but often, we have to live with what we have.

One situation I would seriously reconsider is the structure I use for storing data related to how I interact with external systems. I have an Application object, and I create instances of this for each external system I interact with. Each new Application gets a UUID, and is created as part of a migration. Code in the system uses this UUID to determine if something is for that system.

But that’s not the worst of it.

I also have an AppConfig object, and other related objects that store a relation to an Application. This was fine initially, but as my code got more complex, I hit upon the idea of using Django’s Proxy models, and using the related Application to determine the subclass. So, I have AppConfig subclasses for a range of systems. This is nice: we can even ensure that we only get the right instances (using a lookup to the application to get the discriminator, which I’d probably do a different way next time).

However, we also have other bits of information that we need to store, that has a relation to this AppConfig object.

And here is where we run into problems. Eventually, I had the need to subclass these other objects, and deal with them. That gives a similar benefit to above for fetching filtered lists of objects, however when we try to follow the relations between these, something annoying happens.

Instead of getting the subclass of AppConfig, that we probably want to use because the business logic hangs off that, we instead get the actual AppConfig instances. So, in order to get the subclass, we have to fetch the object again, or swizzle the __class__. And, going back the other way would have the same problem.

Python is a dynamic language, so we should be able to do better.

In theory, all we have to do is replace the attributes on the two classes with ones that will do what we want them to do. In practice, we need to muck around a little bit more to make sure it all works out right.

It would be nice to be able to decorate the declaration of the overridden field, but that’s not valid python syntax:

>>> class Foo(object):
...   @override
...   bar = object()
  File "<stdin>", line 3
    bar = object()

So, we’ll have to do one of two things: alter the class after it has been defined, or leverage the metaclass magic Django already does.

class Foo(models.Model):
    bar = models.ForeignKey('bar.Bar')


class FooProxy(models.Model):
    bar = ProxyForeignKey('bar.BarProxy')  # Note the proxy class

    class Meta:
      proxy = True

However, we can’t just use the contribute_to_class(cls, name) method as-is, as the Proxy model attributes get dealt with before the parent model. So, we need to register a signal, and get the framework to run our code after the class has been prepared:

class ProxyField(object):
    def __init__(self, field):
        self.field = field

    def contribute_to_class(self, model, name):
        @receiver(models.signals.class_prepared, sender=model, weak=False)
        def late_bind(sender, *args, **kwargs):
          override_model_field(model, name, self.field)


class ProxyForeignKey(ProxyField):
    def __init__(self, *args, **kwargs):
        super(ProxyForeignKey, self).__init__(ForeignKey(*args, **kwargs))

Then, it’s a matter of working out what needs to happen to override_model_field.

It turns out: not much. Until we start thinking about edge cases, anyway:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    model.add_to_class(name, field)
    if field.rel:
      field.rel.to.add_to_class(
        field.related_name,
        ForeignRelatedObjectsDescriptor(field.related)
      )

There is a little more to it than that:

  • We need to use the passed-in related_name if one was provided in the new field definition, else we want to use what the original field’s related name was. However, if not explicitly set, then neither field will actually have a related_name attribute.
  • We cannot allow an override of a foreign key to a non-proxy model: that would hijack the original model’s related queryset.
  • Similarly, we can only allow a single proxy to override-relate to another proxy: any subsequent override-relations would likewise hijack the related object queryset.
  • For non-related fields, we can only allow an override if the field is compatible. What that means I’m not completely sure just yet, but for now, we will only allow the same field class (or a subclass). Things that would require a db change would be verboten.

So, we need to guard our override_model_field somewhat:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    if not isinstance(field, original_field.__class__):
        raise TypeError('...')

    # Must do these checks before the `add_to_class`, otherwise it breaks tests.
    if field.rel:
        if not field.rel.to._meta.proxy:
            raise TypeError('...')

        related_name = getattr(field, 'related_name', original_field.related.get_accessor_name())
        related_model = getattr(field.rel.to, related_name).related.model

        # Do we already have an overridden relation to this model?
        if related_model._meta.proxy:
            raise TypeError('...')

    model.add_to_class(name, field)

    if field.rel:
        field.rel.to.add_to_class(
          related_name,
          ForeignRelatedObjectsDescriptor(field.related)
        )

There is an installable app that includes tests: django-proxy-overrides.

Bowled over by Postgres

There’s a nice article at Bowled Over by SQL Window Functions by a chap called Dwain Camps. It’s written from the perspective of T-SQL, which has some differences to Postgres’s DDL and querying. I’ve reworked his stuff into what feels nice for me from a Postgressy perspective.

I’ll recap a couple of things he mentions, but you’ll probably want to head there and read that first.

  • Strike: all pins knocked down on the first ball of a frame. Scores 10 (from this frame), plus the total of whatever the next two balls knock down.
  • Spare: all pins knocked down on the second ball of a frame. Scores 10 (from this frame), plus how ever many pins you knock down on the next ball.
  • Open: at least one pin remains standing at the end of the frame. Score is how ever many pins you knocked down.

By convention, only the running tally is shown on the scoresheet. I’ve kept the frame score as the score for this frame only, and the total will contain the running total.


The first thing I do a bit differently is to use Postgres’ DOMAIN structures, which enables us to remove some of the check constraints, and simplify some others:

CREATE SCHEMA bowling;

CREATE DOMAIN bowling.frame_number AS integer
  CHECK ('[1,10]'::int4range @> VALUE)
  NOT NULL;

CREATE DOMAIN bowling.ball AS integer
  CHECK ('[0,10]'::int4range @> VALUE);

So, now we have two integer domains: the frame number may be between 1 and 10, and the ball pin count may be null, or between 0 and 10.

We’ll start by recreating the table structure: initially without constraints:

CREATE TABLE bowling.frame
(
  game_id INTEGER NOT NULL,
  player_id INTEGER NOT NULL,
  frame bowling.frame_number NOT NULL,
  ball1 bowling.ball NOT NULL,
  ball2 bowling.ball NULL,
  ball3 bowling.ball NULL,
  PRIMARY KEY (game_id, player_id, frame)
);

Not much different to the original, other than using those fresh new domain types.

The other approach I’ve used is to use more constraints, but make them simpler. I’m also relying on the fact that X + NULL => NULL, which means we can leave off a heap of the constraint clauses.

We’ll start by preventing the (non-final) frames from exceeding the number of available pins. In the case of final frame, we still only allow a spare unless we have a strike already.

ALTER TABLE bowling.frame
ADD CONSTRAINT max_spare_unless_frame_10_strike CHECK
(
  ball1 + ball2 <= 10 OR (frame = 10 AND ball1 = 10)
);

This is as simple as it can be. Because ball 2 may be null but ball 1 may not, and each ball must be no greater than 10, this is enough to encapsulate the requirement. There is one slightly incorrect circumstance: a value of (10, 0) would be valid, which is strictly incorrect (ball 2 was never bowled). In the case of the calculations it’s all correct, but if a 0 was bowled immediately after a strike, it may be possible to insert that as ball 2, which would be misleading.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_2_never_bowled_after_strike CHECK
(
  ball2 IS NULL OR ball1 < 10 OR frame = 10
);

We can now prevent ball 3 from being set unless we are on frame 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_3_only_in_frame_10 CHECK
(
  ball3 IS NULL OR frame = 10
);

A follow-up to the previous constraint: we only get to bowl ball 3 if we have bowled ball 2, and scored a strike or spare already. Note, we may have a strike on the first ball, which means we might have more than 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_only_if_eligible CHECK
(
  ball3 IS NULL OR (ball2 IS NOT NULL AND ball1 + ball2 >= 10)
);

Finally, we have some specific allowable conditions for that last ball. We already know that we must have scored a strike or spare with the first two balls, but we need to know how many pins are available to us.

If we scored a spare or two strikes, then we may have any number up to 10 to score now. Otherwise, we have however many pins were left by ball 2 (which means ball 1 must have been a strike).

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_max_spare_or_strike CHECK
(
  ball2 + ball3 <= 10
  OR
  ball1 + ball2 = 20
  OR
  ball1 + ball2 = 10
);

I find those constraints much easier to read that the original ones.


I’ve written a view, that uses a couple of Common Table Expressions (CTEs), as well as the window functions Dwain discussed.

CREATE OR REPLACE VIEW bowling.frame_score AS (
  WITH pin_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      -- Get the first ball from the next frame.
      -- Used by strike and spare.
      LEAD(ball1, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_1,
      -- Get the second ball from the next frame.
      -- Used by strike.
      LEAD(ball2, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_2,
      -- Get the first ball from the next next frame.
      -- Used by double-strike.
      LEAD(ball1, 2) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_next_ball_1
    FROM bowling.frame
  ),
  frame_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      CASE
      -- We will start with frame 10: when we have a strike
      -- or spare, we get all three balls.
      WHEN frame = 10 AND ball1 + ball2 >= 10 THEN
        ball1 + ball2 + ball3
      -- On a strike, we get the next two balls. This could
      -- be from the next frame, or include the first ball
      -- of the frame after that. Note that in frame 9, we will
      -- also look at the second ball from fram 10, rather than
      -- looking for a non-existent frame 11.
      WHEN ball1 = 10 THEN
        ball1 + next_ball_1 + (
          CASE WHEN next_ball_1 = 10 AND frame < 9 THEN
            next_next_ball_1
          ELSE
            next_ball_2
          END
        )
      -- In the case of a spare, grab the next ball. We already
      -- handled a spare on frame 10 above.
      WHEN ball1 + ball2 = 10 THEN
        ball1 + ball2 + next_ball_1
      -- Otherwise, it's just the two balls we bowled in this frame.
      ELSE
        ball1 + ball2
      END AS score
    FROM pin_counts
  )

  -- We have everything we need in the previous CTE, except that
  -- shows us the frame score, rather than the running tally.
  -- We need to do that in another window function here, but
  -- only calculate a value when the frame's wave function has
  -- collapsed (ie, it's score is known).
  SELECT
    frame_counts.*,
    CASE WHEN score IS NOT NULL THEN
      SUM(score) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
        ROWS UNBOUNDED PRECEDING
      )
    ELSE NULL END AS total
  FROM frame_counts
);

Again, I think this is a simpler query, and easier to read. But, I guess I wrote it.

We can insert the same data as used there, and look at our results:

-- Game 1
INSERT INTO bowling.frame VALUES
  (1, 1, 1, 7, 2, NULL),
  (1, 1, 2, 3, 7, NULL),
  (1, 1, 3, 6, 4, NULL),
  (1, 1, 4, 10, NULL, NULL),
  (1, 1, 5, 10, NULL, NULL),
  (1, 1, 6, 10, NULL, NULL),
  (1, 1, 7, 9, 1, NULL),
  (1, 1, 8, 10, NULL, NULL),
  (1, 1, 9, 8, 1, NULL),
  (1, 1, 10, 6, 3, NULL);

-- Game 2
INSERT INTO bowling.frame VALUES
  (2, 1, 1, 10, NULL, NULL),
  (2, 1, 2, 3, 7, NULL),
  (2, 1, 3, 10, NULL, NULL),
  (2, 1, 4, 6, 4, NULL),
  (2, 1, 5, 10, NULL, NULL),
  (2, 1, 6, 9, 1, NULL),
  (2, 1, 7, 10, NULL, NULL),
  (2, 1, 8, 8, 2, NULL),
  (2, 1, 9, 10, NULL, NULL),
  (2, 1, 10, 7, 3, 10);

-- Game 3
INSERT INTO bowling.frame VALUES
  (3, 1, 1, 10, NULL, NULL),
  (3, 1, 2, 10, NULL, NULL),
  (3, 1, 3, 10, NULL, NULL),
  (3, 1, 4, 10, NULL, NULL),
  (3, 1, 5, 10, NULL, NULL),
  (3, 1, 6, 10, NULL, NULL),
  (3, 1, 7, 10, NULL, NULL),
  (3, 1, 8, 10, NULL, NULL),
  (3, 1, 9, 10, NULL, NULL),
  (3, 1, 10, 10, 10, 10);
$ SELECT * FROM frame_score;

 game_id | player_id | frame | ball1 | ball2  | ball3  | score | total
---------+-----------+-------+-------+--------+--------+-------+-------
       1 |         1 |     1 |     7 |      2 | <NULL> |     9 |     9
       1 |         1 |     2 |     3 |      7 | <NULL> |    16 |    25
       1 |         1 |     3 |     6 |      4 | <NULL> |    20 |    45
       1 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |    75
       1 |         1 |     5 |    10 | <NULL> | <NULL> |    29 |   104
       1 |         1 |     6 |    10 | <NULL> | <NULL> |    20 |   124
       1 |         1 |     7 |     9 |      1 | <NULL> |    20 |   144
       1 |         1 |     8 |    10 | <NULL> | <NULL> |    19 |   163
       1 |         1 |     9 |     8 |      1 | <NULL> |     9 |   172
       1 |         1 |    10 |     6 |      3 | <NULL> |     9 |   181
       2 |         1 |     1 |    10 | <NULL> | <NULL> |    20 |    20
       2 |         1 |     2 |     3 |      7 | <NULL> |    20 |    40
       2 |         1 |     3 |    10 | <NULL> | <NULL> |    20 |    60
       2 |         1 |     4 |     6 |      4 | <NULL> |    20 |    80
       2 |         1 |     5 |    10 | <NULL> | <NULL> |    20 |   100
       2 |         1 |     6 |     9 |      1 | <NULL> |    20 |   120
       2 |         1 |     7 |    10 | <NULL> | <NULL> |    20 |   140
       2 |         1 |     8 |     8 |      2 | <NULL> |    20 |   160
       2 |         1 |     9 |    10 | <NULL> | <NULL> |    20 |   180
       2 |         1 |    10 |     7 |      3 |     10 |    20 |   200
       3 |         1 |     1 |    10 | <NULL> | <NULL> |    30 |    30
       3 |         1 |     2 |    10 | <NULL> | <NULL> |    30 |    60
       3 |         1 |     3 |    10 | <NULL> | <NULL> |    30 |    90
       3 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |   120
       3 |         1 |     5 |    10 | <NULL> | <NULL> |    30 |   150
       3 |         1 |     6 |    10 | <NULL> | <NULL> |    30 |   180
       3 |         1 |     7 |    10 | <NULL> | <NULL> |    30 |   210
       3 |         1 |     8 |    10 | <NULL> | <NULL> |    30 |   240
       3 |         1 |     9 |    10 | <NULL> | <NULL> |    30 |   270
       3 |         1 |    10 |    10 |     10 |     10 |    30 |   300
(30 rows)

Building the average scores for a player is likewise similar. Because I’m using a VIEW, I can jut reference that.

SELECT
  player_id,
  AVG(total) as average
FROM frame_score
WHERE frame=10
GROUP BY player_id;
 player_id |       average
-----------+----------------------
         1 | 227.0000000000000000
(1 row)

I’m fairly sure I’ve rewritten the constraints correctly, but may have missed some. Here are some of the condition tests that show invalid constraints:

$ INSERT INTO bowling.frame VALUES(1, 2, 0, 9, NULL, NULL);
ERROR:  value for domain frame_number violates check constraint "frame_number_check"
Time: 0.405 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 11, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.215 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, -1, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.218 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 3, NULL);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 1, 8, 3, null).
Time: 0.332 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 1, 1);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 1, 8, 1, 1).
Time: 0.392 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 2, 1);
ERROR:  new row for relation "frame" violates check constraint "ball_3_only_in_frame_10"
DETAIL:  Failing row contains (1, 2, 1, 8, 2, 1).
Time: 0.327 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 3, 1);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 10, 8, 3, 1).
Time: 0.340 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 2, 11);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.200 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, NULL, 10);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 10, 10, null, 10).
Time: 0.316 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 5, 10, 0, NULL);
ERROR:  new row for relation "frame" violates check constraint "ball_2_never_bowled_after_strike"
DETAIL:  Failing row contains (1, 2, 5, 10, 0, null).
Time: 0.307 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, 2, 9);
ERROR:  new row for relation "frame" violates check constraint "ball3_max_spare_or_strike"
DETAIL:  Failing row contains (1, 2, 10, 10, 2, 9).
Time: 0.323 ms

Finally, I rewrote the pretty printer. It’s not quite perfect (I don’t like how I get the plus signs at the newline character), but it is workable:

WITH symbols AS (
  SELECT
    game_id, player_id, frame,
    CASE WHEN ball1 = 10 THEN 'X' ELSE ball1::text END as ball1,
    CASE WHEN ball2 IS NULL THEN ' '
         WHEN ball1 + ball2 = 10 THEN '/'
         WHEN ball1 = 10 AND ball2 = 10 THEN 'X'
         ELSE ball2::text
         END as ball2,
    CASE WHEN ball3 IS NULL THEN ' '
    WHEN ball3 = 10 THEN 'X'
    WHEN ball3 + ball2 = 10 THEN '/'
    ELSE ball3::text
    END as ball3,
    lpad(total::text, 5, ' ') as total
  FROM
    frame_score
  ORDER BY game_id, player_id, frame
), grouped_data AS (
  SELECT
    game_id,
    player_id,
    array_agg(ball1) ball1,
    array_agg(ball2) ball2,
    array_agg(ball3) ball3,
    array_agg(total) total
  FROM
    symbols
  GROUP BY
    game_id, player_id
)
SELECT
  game_id,
  player_id,
  ball1[1] || ' | ' || ball2[1] || ' ' || chr(10) || total[1] AS "1",
  ball1[2] || ' | ' || ball2[2] || ' ' || chr(10) || total[2] AS "2",
  ball1[3] || ' | ' || ball2[3] || ' ' || chr(10) || total[3] AS "3",
  ball1[4] || ' | ' || ball2[4] || ' ' || chr(10) || total[4] AS "4",
  ball1[5] || ' | ' || ball2[5] || ' ' || chr(10) || total[5] AS "5",
  ball1[6] || ' | ' || ball2[6] || ' ' || chr(10) || total[6] AS "6",
  ball1[7] || ' | ' || ball2[7] || ' ' || chr(10) || total[7] AS "7",
  ball1[8] || ' | ' || ball2[8] || ' ' || chr(10) || total[8] AS "8",
  ball1[9] || ' | ' || ball2[9] || ' ' || chr(10) || total[9] AS "9",
  ball1[10] || ' | ' || ball2[10] || ' | ' || ball3[10] || ' ' || chr(10) || lpad(total[10], 9, ' ') AS "10"
FROM grouped_data;
 game_id | player_id |   1    |   2    |   3    |   4    |   5    |   6    |   7    |   8    |   9    |     10
---------+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+------------
       1 |         1 | 7 | 2 +| 3 | / +| 6 | / +| X |   +| X |   +| X |   +| 9 | / +| X |   +| 8 | 1 +| 6 | 3 |   +
         |           |     9  |    25  |    45  |    75  |   104  |   124  |   144  |   163  |   172  |       181
       2 |         1 | X |   +| 3 | / +| X |   +| 6 | / +| X |   +| 9 | / +| X |   +| 8 | / +| X |   +| 7 | / | X +
         |           |    20  |    40  |    60  |    80  |   100  |   120  |   140  |   160  |   180  |       200
       3 |         1 | X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X | X | X +
         |           |    30  |    60  |    90  |   120  |   150  |   180  |   210  |   240  |   270  |       300
(3 rows)

That will do for now. Corrections welcome!

Humax 4tune

I recently bought a Humax 4tune (hms1000t). I’m rather liking in so far: 4 tuners, a reasonable UI.

I would like to be able to get it to do a bit more, so I’ve started investigating it. It shows the following ports as being open:

Port Scan has started…

Port Scanning host: 10.0.1.30

	 Open TCP Port: 	21     		ftp
	 Open TCP Port: 	139    		netbios-ssn
	 Open TCP Port: 	445    		microsoft-ds
	 Open TCP Port: 	9000   		cslistener
	 Open TCP Port: 	9251
	 Open TCP Port: 	9252
	 Open TCP Port: 	36666
	 Open TCP Port: 	36667
	 Open TCP Port: 	48629
	 Open TCP Port: 	50001
	 Open TCP Port: 	55200
	 Open TCP Port: 	56789
	 Open TCP Port: 	56790
	 Open TCP Port: 	58816
	 Open TCP Port: 	60001

Port Scan has completed…

Obviously, I’ve turned on the FTP and Samba services.

Port 9000 appears to be an RTSP service:

$ curl http://<ip-address>:9000
RTSP

Ports 9251 and 9252 respond with:

{"errorCode": 404, "errorDescription":"Specified WID not found"}

The latter requires HTTPS, but uses a certificate for the domain name htvdev.humaxtvportal.com. The humaxtvportal.com domain is registered.

Port 36666 responds with:

RTSP/1.0 200 OK
CSeq: (null)
Apple-Jack-Status: connected; type=analog

Port 50001 responds with an XML document:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0" xmlns:dlna="urn:schemas-dlna-org:device-1-0" xmlns:hmx="urn:schemas-humax-co-kr:metadata-1-0"  >
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <deviceType>urn:schemas-upnp-org:device:MediaServer:1</deviceType>
    <hmx:X_HMXCAP>TSR,MULTIVALUE,RCUTYPE_LIME,HX_N_ALL_P</hmx:X_HMXCAP>
    <dlna:X_DLNACAP xmlns:dlna="urn:schemas-dlna-org:device-1-0"></dlna:X_DLNACAP>
    <dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMS-1.50</dlna:X_DLNADOC>
    <friendlyName>Lounge 10.0.1.30</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>None</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <icube:X_WOON_PAIRING xmlns:icube="urn:schemas-icube-co-kr:WoonDevice-1-0">hmx_base_08eb74e20b66</icube:X_WOON_PAIRING>
    <iconList>
      <icon>
        <mimetype>image/jpeg</mimetype>
        <width>48</width>
        <height>48</height>
        <depth>24</depth>
        <url>/icon/dms_48.jpg</url>
      </icon>
      <icon>
        <mimetype>image/png</mimetype>
        <width>48</width>
        <height>48</height>
        <depth>24</depth>
        <url>/icon/dms_48.png</url>
      </icon>
      <icon>
        <mimetype>image/jpeg</mimetype>
        <width>120</width>
        <height>120</height>
        <depth>24</depth>
        <url>/icon/dms_120.jpg</url>
      </icon>
      <icon>
        <mimetype>image/png</mimetype>
        <width>120</width>
        <height>120</height>
        <depth>24</depth>
        <url>/icon/dms_120.png</url>
      </icon>
    </iconList>
    <serviceList>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
        <SCPDURL>ConnectionManager/scpd.xml</SCPDURL>
        <controlURL>ConnectionManager/control</controlURL>
        <eventSubURL>ConnectionManager/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ContentDirectory:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ContentDirectory</serviceId>
        <SCPDURL>ContentDirectory/scpd.xml</SCPDURL>
        <controlURL>ContentDirectory/control</controlURL>
        <eventSubURL>ContentDirectory/event</eventSubURL>
      </service>
    </serviceList>
  </device>
</root>

This appears to be DLNA-related.

Likewise with port 55200:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0">
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMR-1.50</dlna:X_DLNADOC>
    <deviceType>urn:schemas-upnp-org:device:MediaRenderer:1</deviceType>
    <friendlyName>Lounge</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>http://www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>[redacted]</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <serviceList>
      <service>
        <serviceType>urn:schemas-upnp-org:service:AVTransport:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:AVTransport</serviceId>
        <SCPDURL>AVTransport/scpd.xml</SCPDURL>
        <controlURL>AVTransport/control</controlURL>
        <eventSubURL>AVTransport/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
        <SCPDURL>ConnectionManager/scpd.xml</SCPDURL>
        <controlURL>ConnectionManager/control</controlURL>
        <eventSubURL>ConnectionManager/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:RenderingControl:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:RenderingControl</serviceId>
        <SCPDURL>RenderingControl/scpd.xml</SCPDURL>
        <controlURL>RenderingControl/control</controlURL>
        <eventSubURL>RenderingControl/event</eventSubURL>
      </service>
    </serviceList>
  </device>
</root>

Port 56789 returns a 404: I’m guessing we can figure out a path that might work.

Port 56790 is a “DIAL SERVER”:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0" xmlns:r="urn:restful-tv-org:schemas:upnp-dd">
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <deviceType>urn:schemas-upnp-org:device:tvdevice:1</deviceType>
    <friendlyName>HMS1000T - DIAL SERVER:10.0.1.30</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>http://www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>[redacted]</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <serviceList>
      <service>
        <serviceType>urn:dial-multiscreen-org:service:dial:1</serviceType>
        <serviceId>urn:dial-multiscreen-org:serviceId:dialControl</serviceId>
        <SCPDURL>dial/scpd.xml</SCPDURL>
        <controlURL/>
        <eventSubURL/>
      </service>
    </serviceList>
  </device>
</root>

Port 58816 rejects connections, but port 60001 is an interesting one:

$ telnet [address] 60001
Trying [address]...
Connected to [address].
Escape character is '^]'.
root:> help
* help [root:]
** registered process list
    corsair                     homma                       prism                       obama
    sama                        ipepg
** helpall : display all command set
root:> helpall
* help all
[corsair]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
  [OAPI]
     oapi_output_vid  :  oapi_output_video debug api call
     oapi_output_hdm  :  oapi_output_hdmi debug api call
[homma]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
[prism]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [ONDK]
     hello            :  help..
     vkcmd            :  vkcmd [mem/sem/task/timer]
     timerlist        :  timerlist
     gwmlist          :  gwmlist
  [OAPI]
     oapi_output_hdm  :  oapi_output_hdmi debug api call
[obama]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [pal_hdmi]
     hdcp_enable      :  HDCP on/off
     status           :  Print HDMI status.
     capa             :  Print HDMI TV capability.
     hdcp_event       :  Send HDCP event to appkit
     output           :  Enable/Disable HDMI output.
     avinfo           :  Set AVI infoframe.
  [OM]
     om_output_hdmi   :  om_output_hdmi debug api call
     om_output_video  :  om_output_video debug api call
  [PAL]
     pal_scl          :  pal scaler debug api call
     pal_scart        :  pal scart debug api call
     pal_out          :  pal out debug api call
     pal_video        :  pal video debug api call
     pal_audio        :  pal audio debug api call
  [RLIB]
     print_rsc        :  print Resource Info
  [SVC]
     svc_av           :  tv service /AV debug api call
     svc_out          :  svc out debug api call
[sama]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
  [SamaCmd]
     print_systime    :  print system time
     print_rpc        :  print SAMA rpc
     print_slot       :  print SAMA slot status
     print_sch        :  print SAMA Schedules
     print_timer      :  print SAMA Timer
     make_sch         :  make SAMA Schedule
     del_sch          :  delete SAMA Schedule
     make_ransch      :  make SAMA Schedule (TBR)
     del_sch_all      :  delete ALL SAMA Schedule
[ipepg]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
root:>

Oh my. More to come, I suspect…

Postgres Tree Shootout part 2: Adjacency List using CTEs

This is the second post in an ongoing series dealing with storing Hierarchical or Tree data structures in Postgres. You should read the Introduction if you haven’t already.

This post contains the queries that illustrate how an Adjacency List model can be used to represent a Hierarchical set of data, including data definitions, and the various operations that have been defined in the aforementioned introduction.

I’ve discussed Adjacency Lists in the past, but I’ll quickly recap why I think they are good.

  • They are conceptually simple to understand
  • They enforce referential integrity
  • They can be modelled with most ORMs without any extra infrastructure
  • Many of the operations are non-complex
  • Recursive queries allow us to perform the complex queries in reasonable time

To help build suspense (but more because I haven’t yet come up with a way to generate a nice reproducible, yet complex tree), this post may discuss the complexity of the queries, but will not contain any measurements.

Initial tree

Before each operation, our data will look like this (where parents point to children):

2014-11-26 11:27ZCanvas 1Layer 112345678910111213141516

We will assume reversion back to this structure after each operation: we could do this using a TRUNCATE followed by an INSERT; or we could run the operation in a transaction and rollback.

There may be a post which shows the effects of each of the queries below in a graphical form.

Table structure

Adjacency Lists are dead simple. Each node simply contains a reference to it’s parent node.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

We can insert our data using a single statement:

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

Inserting a single leaf node is simple. To insert a single node as a child of node 13, for example:

INSERT INTO nodes (parent_id) VALUES (13);

Inserting a new root node is slightly more complicated. In most ORMs, we would probably do it in two queries: one to create the new node, and a second to update the other root nodes to point to this one. We’ll see that below.

We can do slightly better in raw SQL: UPDATE our table with the result of an INSERT ... RETURNING that occurs inside a Common Table Expression (CTE, or WITH query).

WITH parent AS (
  INSERT INTO nodes(parent_id)
  VALUES (NULL)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = parent.node_id
FROM parent
WHERE parent_id IS NULL;

We can use the same pattern to insert an intermediate node in our tree. For instance, inserting a new node between nodes 11 and 12:

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (11)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.node_id = 12;

And our last insert, adding a child node, that gets it’s siblings as children. For instance, adding a new node under node 12, which gets all of node 12’s children as it’s children.

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (12)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.parent_id = 12;

All of these queries should perform relatively well: the CTE will be very simple (as it is no different to the single leaf insert), and the UPDATE should likewise be fairly simple: it needs to filter out which existing nodes do not need to be updated; and then it needs to update the remainder of the rows with the value pulled from the CTE.

This theoretically is only marginally more complex than just a simple UPDATE foo SET bar=baz WHERE quux IS NULL style query.

If we were using an ORM, we might need to do this in two queries: something like this (in Django):

# Insert new root node: all other root nodes now have this as a parent
new_node = Node.objects.create()
Node.objects.filter(parent=None).exclude(pk=new_node.pk).update(parent=new_node)
# Could possibly do as:
Node.objects.filter(parent=None).update(parent=Node.objects.create().pk)

# Insert new node, with a single child as it's child (and that child's previous
# parent as it's parent)
new_node = Node.objects.create(parent=old_node.parent)
old_node.parent = new_node
old_node.save()

# Insert new node, with children of that node's parent now children of the node.
new_node = parent_node.children.create()
parent_node.children.exclude(pk=new_node.pk).update(parent=new_node)
# Again, may be able to do:
parent_node.children.update(parent=parent_node.children.create().pk)

Note the required exclusion of the newly created node: we don’t have to do this in the CTE versions, as that doesn’t “exist” at the time the other part of the query runs.

Removals

Removing a single leaf node is no different than removing a row from a normal table. Removing node 9, for instance:

DELETE FROM nodes WHERE node_id = 9;

Because the information about parent-child relationships is stored in the child, we do not need to do anything else to maintain the tree.

To remove a single root node (in this case, node 1), and promote all children to root nodes themselves, we can do two queries:

UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;
DELETE FROM nodes WHERE node_id = 1;

It may be possible to do this in a single query, similar to the CTE queries above, but I’m not sure of the benefit.

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 1
)
UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;

The same pattern can be used for removing a node, and attaching it’s children to it’s parent. Here, we will remove node 2, and attach it’s children (4 and 5) as children of it’s parent, node 1:

UPDATE nodes
SET parent_id = (SELECT parent_id FROM nodes WHERE node_id = 2)
WHERE parent_id = 2;

DELETE from nodes WHERE node_id = 2;

This is a place where using a CTE might make things clearer - especially if we have the node-to-be-deleted’s id, but not it’s parent:

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 2
  RETURNING node_id, parent_id
)
UPDATE nodes
SET parent_id = deleted.parent_id
FROM deleted
WHERE nodes.parent_id = deleted.node_id;

Righto, now we are up to the traditionally “hard” things for an Adjacency List to perform. Dealing with removing an arbitrary depth of (sub)tree.

We’ll need to create a recursive CTE, and delete according to that. Let’s just select from that initially, so we can see what the CTE data will look like:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT * FROM tree;
 node_id | ancestors
---------+------------
       1 | {}
      10 | {}
       2 | {1}
       3 | {1}
      11 | {10}
       4 | {1,2}
       5 | {1,2}
       6 | {1,3}
       7 | {1,3}
      12 | {10,11}
      13 | {10,11}
       8 | {1,2,4}
      14 | {10,11,12}
      15 | {10,11,12}
      16 | {10,11,12}
       9 | {1,2,4,8}
(16 rows)

Coolio. So, on to our operations. Let’s remove the subtree starting with node 2. I’ll hide the CTE, since it will be the same for quite a few of these operations:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
) OR node_id = 2;

The query is identical for a full tree (node 1 and descendants):

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 1 = ANY(tree.ancestors)
) OR node_id = 1;

And it’s nearly identical for just the descendants of a given node. Here, for all of node 2’s descendants, but not that node itself:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
);

Moves

Because the relationship is stored purely on the child element, moving around trees and subtrees is very easy. We can start with moving subtree starting with 3 to underneath node 4:

UPDATE nodes
SET parent_id = 4 WHERE node_id = 3;

Nothing surprising there. Similarly, the query is identical for moving a leaf to a different parent, a root node into a tree, and turning a subtree into a full tree (making that node a root node).

UPDATE nodes SET parent_id = 6 WHERE node_id = 5;
UPDATE nodes SET parent_id = 8 WHERE node_id = 10;
UPDATE nodes SET parent_id = NULL WHERE node_id = 2;

The final move: all of node’s children to a different node is almost as simple:

UPDATE nodes SET parent_id = 5 WHERE parent_id = 12;

This seems to be a situation where Adjacency Lists are really good. None of these queries are any more complex than the simplest UPDATE you could think of.

Fetches

Using the same CTE, we can perform our fetches. We may need to extend it to deal with depths, but since the ancestors column contains ancestors starting with the root node, we could count stuff in there. Let’s see how it goes.

Descendants

Fetching all descendants of a given node just means we want to see if the node occurs at all in each row’s ancestors. To get all of node 10’s descendants:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree WHERE 10 = ANY(tree.ancestors);

However, we could improve this by starting with just the node we care about, or more specifically, it’s children:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree;

Obviously, this is far less generic, but it is also significantly less complex. For starters, it only builds up the part of the tree we care about, and then just returns the node ids, rather than building up the whole tree, and then discarding the parts that are not required.

The same code can be used for determining the number of descendants, but with a COUNT(node_id) in the final query.

To get our depth-limited query, we can approach from two directions. To get the subtree to depth 2 from above:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
  AND cardinality(tree.ancestors) < 2
)
SELECT node_id FROM tree;

To do the same in the more generic form have to look at how close the desired node is to the end of the ancestors array:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id
FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

(Note that this is a bug fix on the original version of this post).

Ancestors

Fetching ancestors from our generic CTE is a bit simpler, because that data is already part of the query:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

To do the equivalent of the hand-built CTE, we would need to start with the node, and build back the other way. It’s getting late here, so I can’t think of a way to do this right now that doesn’t get stuck doing infinite recursion.

The count query is an interesting one: we can just remove the need to unnest, and take the cardinality:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT cardinality(ancestors) FROM tree WHERE node_id = 15;

The depth query is a little trickier. We want to know the ancestors of node 15, up to a depth of 2. If our ancestors array was in the reverse order, we should be able to unnest and limit.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, nodes.parent_id || tree.ancestors
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15 LIMIT 2;

We can do this because a node only has one parent: so limiting the number of ancestors (when sorted nearest ancestor first) is the same as limiting the depth.

Special queries

Fetching all leaf nodes is just a matter of excluding those that have a relationship to another node as it’s parent:

SELECT node_id FROM nodes
WHERE node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Trick for new players: if you leave off the WHERE clause in that sub-query, you won’t get any matches!

Fetching the number of leaf nodes is trivial.

Fetching root nodes (or the number of) is simpler than leaf nodes:

SELECT node_id FROM nodes
WHERE parent_id IS NULL;

Fetching non-leaf nodes, and non-root nodes is just negations of the two queries above:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

SELECT node_id FROM nodes WHERE parent_id IS NOT NULL;

And the non-leaf, non-root nodes just combines these two queries:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
) AND parent_id IS NOT NULL;

As an aside: there is also the inverse of this: nodes which are isolated (root AND leaf nodes):

SELECT node_id FROM nodes
WHERE parent_id IS NULL
AND node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Well, that’s our operations. Most of them are really simple. For anything that requires us to fetch or delete a whole subtree, we needed to revert to recursive CTEs, and for some of the other operations, using a CTE makes it simpler (and easier to understand).

Next, we will look at an alternative to the CTE operations, using a recursive view. From there, we should be able to look at a trigger-based approach that materializes our tree (node, ancestors) data in a table, and keeps it up to date. That’s, as I hinted, getting close to a Materialized Path approach, but keeps the conceptual simplicity of the Adjacency List, and hopefully prevents possible issues relating to referential integrity.

Postgres Tree Shootout part 1: Introduction.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.


Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.


Operations

I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.

Insertions

  • Insert a single leaf node
  • Insert a single root node (all existing root nodes will then point to this)
  • Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)
  • Insert a single node partway through the tree, with this node’s parent’s existing children all becoming children of the new node.

Removals

  • Remove a single leaf node
  • Remove a single root node (all children of this are promoted to root nodes)
  • Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
  • Remove a subtree (a single non-root node and it’s descendants)
  • Remove a whole tree (a single root node and all descendants)
  • Remove all descendants of a specific node (but not the node itself)

Moves

  • Move a subtree from one parent to another
  • Move a single leaf node to a different parent
  • Move a root node into a tree
  • Make a subtree into a tree (turn a node into a root node).
  • Move all children of a node to a different parent

Fetches

  • Fetch all descendants of a given node
  • Fetch the number of descendants of a given node
  • Fetch descendants of a given node to a given depth
  • Fetch the number of descendants of a given node to a given depth
  • Fetch all ancestors of a given node
  • Fetch the number of ancestors of a given node
  • Fetch ancestors of a given node to a given depth
  • Fetch the number of ancestors of a given node to a given depth

    I don’t think this makes any sense.

  • Fetch all leaf nodes
  • Fetch the number of leaf nodes
  • Fetch all root nodes
  • Fetch the number of root nodes
  • Fetch all non-leaf nodes
  • Fetch the number of non-leaf nodes
  • Fetch all non-root nodes
  • Fetch the number of non-root nodes
  • Fetch all non-root, non-leaf nodes
  • Fetch the number of non-root, non-leaf nodes

Aggregating ranges in Python

This is a bit of a follow-up to Aggregating Ranges in Postgres.

Since we don’t have a nice range type in Python, we will just use a tuple that contains a lower and upper bound. We will assume that this is a canonical form: the lower bound is inclusive, the upper bound is non-inclusive. We will also assume (for simplicity) that the 0-th item is always less than the 1-th item in the tuple.

Given a list of these range-tuples, we want to do the two things we saw in the previous post.

  • Aggregate them into the simplest possible array of range-tuples, using a union operation.
  • Determine the parts that are missing.

We will use in in a similar way:

>>> data =[(2,3),(3,4),(11,12),(5,12),(4,5)]
>>> aggregate_union(data)
(2, 12)
>>> missing_ranges(data)
[(None, 2), (12, None)]

Luckily, None sorts before any integer in python, so we will just be able to use the normal sort.

def aggregate_union(data):
    if not data:
      return None

    sorted_data = sorted(data)
    result = sorted_data[0]

    for lower, upper in sorted_data[1:]:
        # If we ever find result[1] is None, we know it covers any
        # other possible values, so we can stop at that point.
        if result[1] is None:
            break

        if lower > result[1]:
            return None  # Or raise an exception, if you want.

        result = (result[0], max(upper, result[1]))

    return result

The alternative, missing_ranges(data) takes cues from the SQL version too.

def missing_ranges(data):
    if not data:
      return (None, None)

    result = []
    # We do a little fancy stuff here: append an extra item that
    # mimics what we were able to use lead for, but in a different
    # way so we can use [i + 1] later.
    sorted_data = sorted(data) + [(None, None)]

    if sorted_data[0][0] is not None:
        # Note: the upper bound here is not quite correct!
        result.append((None, sorted_data[0][1]))

    for i, (lower, upper) in enumerate(sorted_data):
        # Grab the next item in our sorted list. Normally, we would check
        # to see if we are at the end, but we will rely on the break later
        # on: we can always stop processing when the upper bound of our
        # next item is `None`
        next = sorted_data[i + 1]

        if upper < next[0] or (next[0] is None and upper is not None):
            result.append((upper, next[0]))

        # Now, exit before we ever get to a bounds error on getting the next item.
        if next[1] is None:
          break

    return result

However, there is a problem here that is nicely solved by the Range types within postgres:

>>> missing_ranges(data)
[(None, 3), (12, None)]

We need to subtract from the first object’s upper bound. Which is easy with integer tuple-ranges (basically, any discrete range type), but not so much with continuous ranges.