On Fences and Functions

I grew up on a farm.

We had fences on the farm.

Whilst the jobs associated with fences and fencing are less than fun, the fences themselves are extremely important. They keep the livestock in the correct location. When you have a damaged or incomplete fence, even if it is only damaged in a small way, it can cost significant amounts of money, even human lives. This can vary between keeping Rams from a flock of Ewes that you don’t want them to mate with (because you need to know which Ram mated with which Ewes in order to track progeny), to livestock escaping onto a public road and causing accidents.

Fences are a good thing.


My first career was as a Design and Technology Teacher.

We use fences in woodwork. They are attachments to fixed power tools, such as drill presses and circular saws. They allow us to work safely and to get accurate, easily repeatable results. For instance. we can use a fence to cut sheets of MDF to exactly the same width, ensuring the bookcase we are making is square. Without a fence, it can still be done, but it will certainly be much harder.

Fences are a good thing.


I’d heard people describe Postgres’s CTEs (Common Table Expressions) as an “optimisation fence”. Given my previous uses of the word “fence”, I assumed that this was widely regarded as a good thing.

However, after spending some time writing really complex queries (that are most easily described using a CTE), I happened to read PostgreSQL’s CTEs are optimisation fences. It had (throughout my work within Postgres) become plain to me that each term in a CTE is materialised (if it is referenced at all), before any filtering that might occur later would allow it to be filtered earlier. Postgres is pretty good about pushing these changes down into a sub-query, but it can mean that a CTE performs worse, as it might have to do more work. However, this article points this out in some detail, and it occurred to me that perhaps some people see fences (in general) as an obstacle. Perhaps fencing something in has negative connotations?

I’m not sure that that’s exactly what the author meant (I wonder if it was sarcasm, perhaps), but it did get me thinking about how different backgrounds could result in opposite interpretations of the same terms.


I do want to veer back a bit into a technical manner, and discuss how I have been overcoming the fact that it’s not possible to push the filtering back up the stack in a CTE.

Largely, the issue exists in my code because I have lots of complex queries (as I just mentioned) that are far easier to write, reason about and debug when written using a CTE. I would like to write them as a VIEW, and then stick Django models in front of them, and I’d be able to query them using the ORM, and just have the view as the db_table of the model. It would be really nice.

But this doesn’t work, because some of the queries require aggregating data across models of which there are millions of rows, and some of the database tables are less than optimal. For instance, I have several tables that store an effective_from field, and in the case of superseding, the same set of other fields (person, for instance) means we can know which one applies on a given date. However, to query this, we end up writing a more complex query (instead of being able to do a date <@ daterange query, if the valid period was stored in the table). I’ve learned from this in newer models, but some stuff is too deeply ingrained to be able to be changed just yet.

So, I have a VIEW that turns this into a data that actually contains dateranges, and I can query against that. But, if I use this in a CTE, then it can materialise the whole lot, which can be slow. So, I needed to come up with a way to filter the data earlier.

Functions.

I’ve been writing SQL functions that take parameters, and then filter as early as possible. This then means that it’s a real possibility that we can get <100ms queries for stuff that is really, really complicated (and joins a couple of dozen or more tables in really funky ways). It does mean I can’t query using the Django ORM, but that’s okay: the data I’m getting back doesn’t necessarily map onto a model anyway, and we need to use it as a dict.

More recently, I’ve extended this so that the function (with the relevant parameters, extracted out of the queryset WHERE clauses) can be used as the db_table for a Model. It’s still somewhat hacky, but is very interesting, nonetheless.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')