Long Live Adjacency Lists
-
Comments:
- here.
I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.
The book discusses four methods of storing a tree in a database.
Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.
Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree
structure would give you over simply storing the keys in an ARRAY
type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.
Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.
Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.
From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.
- Adding a new row never requires you to alter any other rows in the database.
- Moving a subtree to a different location only requires a change to one now in the database.
- It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to
CASCADE
orSET NULL
the children automatically). - It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.
There is really only two things that are hard to do:
- Given a node, select all descendants of that node.
- Given a node, select all ancestors of that node.
But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.
There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:
class Node(models.Model):
node_id = models.AutoField(primary_key=True)
parent = models.ForeignKey('self', null=True, blank=True, related_name='children')
That’s it.
Now, using Postgres, it’s possible to build a recursive VIEW
that contains the whole tree:
CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
SELECT node_id, '{}'::integer[]
FROM nodes WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id
FROM nodes n, tree t
WHERE n.parent_id = t.node_id
);
We can then query this (replacing %s
with the parent node id):
SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);
Or, if you want to select for multiple parents:
SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;
This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW
based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).
CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);
CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
BEGIN
REFRESH MATERIALIZED VIEW tree_m;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();
This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:
CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
SELECT node_id, '{}'::integer[], 0
FROM nodes WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
FROM nodes n, tree t
WHERE n.parent_id = t.node_id
);
SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;
This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:
CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
SELECT node_id, '{}'::integer[], 0, FALSE
FROM nodes WHERE parent_id IS NULL
UNION ALL
SELECT
n.node_id, t.ancestors || n.parent_id, t.depth + 1,
n.parent_id = ANY(t.ancestors)
FROM nodes n, tree t
WHERE n.parent_id = t.node_id
AND NOT t.cycle
);
You don’t need to use the cycle
column outside of the view.
The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):
WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
SELECT node_id, '{}'::integer[], 0, FALSE
FROM nodes WHERE parent_id IS NULL
UNION ALL
SELECT
n.node_id, t.ancestors || n.parent_id, t.depth + 1,
n.parent_id = ANY(t.ancestors)
FROM nodes n, tree t
WHERE n.parent_id = t.node_id
AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);
You can see that this syntax basically defines the view before running the real query.
Looking at it from the perspective of Django, we would like to be able to spell a query something like:
Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)
The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations
automatically adds a migration operation to create the recursive view.
The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.
Here is an implementation of doing just that:
from django.db import models
SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
SELECT "{pk}", ARRAY[]::integer[], FALSE
FROM "{table}" WHERE "{fk}" IS NULL
UNION ALL
SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
FROM "tree" b, "{table}" a
WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""
class RecursiveRelation(models.ForeignKey):
def __init__(self, *args, **kwargs):
super(RecursiveRelation, self).__init__('self', *args, **kwargs)
def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
raw_value):
if lookups[0] == 'recursive':
# With a recursive query, we want to build up a subquery that creates
# the simplest possible tree we can deal with.
data = {
'fk': self.get_attname(),
'pk': self.related_fields[0][1].get_attname(),
'table': self.model._meta.db_table
}
if lookups[-1] == 'in':
if targets[0] == self:
raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
else:
raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
else:
if targets[0] == self:
raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
else:
raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)
# Rewrite some variables so we get correct behaviour.
# This makes the query based on the original table, not the joined version,
# which was skipping a level of relation. It still joins the table, however,
# which can't be great for performance
alias = self.model._meta.db_table
# This sets the correct lookup type, removing the recursive bit.
lookups = lookups[1:] or ['exact']
return super(RecursiveRelation, self).get_lookup_constraint(
constraint_class, alias, targets, sources, lookups, raw_value
)
class ForeignKeyRecursiveLookup(object):
query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'
def __init__(self, value, **kwargs):
self.value = value
self.data = kwargs
def get_compiler(self, *args, **kwargs):
return self
def as_subquery_condition(self, alias, columns, qn):
sql = SQL.format(
query=self.query.format(**self.data),
**self.data
)
return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]
class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'
class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'
class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'
If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL
constant would be simpler:
SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'
But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree
.
I went into this exercise thinking it would be simple: just write a Lookup
(or Transform
), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.
For now, though, you use it as:
class Node(models.Model):
node_id = models.AutoField(primary_key=True)
parent = RecursiveRelation(null=True, blank=True, related_name='children')