Adjacency Lists in Django with Postgres

Today, I’m going to walk through modelling a tree in Django, using an Adjacency List, and a Postgres View that dynamically creates the materialised path of ancestors for each node.

With this, we will be able to query the tree for a range of operations using the Django ORM.

We will start with our model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.node', related_name='children', null=True, blank=True)

    class Meta:
        app_label = 'tree'

We will also build an unmanaged model that will be backed by our view.

from django.contrib.postgres.fields import ArrayField

class Tree(models.Model):
    root = models.ForeignKey(Node, related_name='+')
    node = models.OneToOneField(Node, related_name='tree_node', primary_key=True)
    ancestors = ArrayField(base_field=models.IntegerField())

    class Meta:
        app_label = 'tree'
        managed = False

You’ll notice I’ve included a root relation. This could be obtained by using ancestors[0] if ancestors else node_id, but that’s a bit cumbersome.

So, on to the View:

CREATE RECURSIVE VIEW tree_tree(root_id, node_id, ancestors) AS

SELECT node_id, node_id, ARRAY[]::INTEGER[]
FROM tree_node WHERE parent_id IS NULL

UNION ALL

SELECT tree.root_id, node.node_id, tree.ancestors || node.parent_id
FROM tree_node node INNER JOIN tree_tree tree ON (node.parent_id = tree.node_id)

I’ve written this view before, so I won’t go into any detail.

We can create a tree. Normally I wouldn’t specify the primary key, but since we want to talk about those values shortly, I will. It also means you can delete them, and recreate with this code, and not worry about the sequence values.

from tree.models import Node

Node.objects.bulk_create([
  Node(pk=1),
  Node(pk=2, parent_id=1),
  Node(pk=3, parent_id=1),
  Node(pk=4, parent_id=2),
  Node(pk=5, parent_id=2),
  Node(pk=6, parent_id=3),
  Node(pk=7, parent_id=3),
  Node(pk=8, parent_id=4),
  Node(pk=9, parent_id=8),
  Node(pk=10),
  Node(pk=11, parent_id=10),
  Node(pk=12, parent_id=11),
  Node(pk=13, parent_id=11),
  Node(pk=14, parent_id=12),
  Node(pk=15, parent_id=12),
  Node(pk=16, parent_id=12),
])

Okay, let’s start looking at how we might perform some operations on it.

We’ve already seen how to create a node, either root or leaf nodes. No worries there.

What about inserting an intermediate node, say between 11 and 12?

node = Node.objects.create(parent_id=11)
node.parent.children.exclude(pk=node.pk).update(parent=node)

I’m not sure if it is possible to do it in a single statement.

Okay, let’s jump to some tree-based statements. We’ll start by finding a sub-tree.

Node.objects.filter(tree_node__ancestors__contains=[2])

Oh, that’s pretty nice. It’s not necessarily sorted, but it will do for now.

We can also query directly for a root:

Node.objects.filter(tree_node__root=10)

We could spell that one as tree_node__ancestors__0=10, but I think this is more explicit. Also, that one will not include the root node itself.

Deletions are also simple: if we can build a queryset, we can delete it. Thus, deleting a full tree could be done by following any queryset by a .delete()

Fetching a node’s ancestors is a little trickier: because we only have an array of node ids; thus it does two queries.

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors)

The count of ancestors doesn’t require the second query:

len(Node.objects.get(pk=15).tree_node.ancestors)

Getting ancestors to a given depth is also simple, although it still requires two queries:

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors[-2:])

This is a fairly simple way to enable relatively performance-aware queries of tree data. There are still places where it’s not perfect, and in reality, you’d probably look at building up queryset or model methods for wrapping common operations.

Performance testing Adjancency List recursive queries

Yesterday, I wrote up some ideas about doing recursive queries on Adjacency Lists using Postgres. Today, I wrote up some code that allows me to run some tests on larger data sets. It’s worth noting that this is still somewhat “toy” data, but I did see comparable results with a real query.

Firstly, our data structure:

CREATE TABLE node (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES node(node_id)
);

Now, we want to be able to populate it with test data. This function will allow you to populate any number of records, with a 10% chance that any given record will be a root (have no parent). If it has a parent, it will be randomly selected from all existing rows. This means that earlier rows have a much higher chance of being a parent, and the first row is overwhemingly likely to have the most descendants (as it has a 90% chance that row 2 will have it as a parent, and therefore any descendants of that will also be descendants of row 1…)

CREATE OR REPLACE FUNCTION populate_nodes(count integer) RETURNS void AS $$
BEGIN
  FOR i IN 2..count LOOP
    IF ((SELECT count(*) FROM node) = 0) or (random() < 0.1) THEN
      INSERT INTO node (parent_id) SELECT NULL;
    ELSE
      INSERT INTO node (parent_id) SELECT node_id FROM node OFFSET random() * (SELECT count(*) FROM node) LIMIT 1;
    END IF;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Let's stick 10k records in there
SELECT populate_nodes(10000);

For now, we want to find all descendants of node 1.

I can think of eleven ways we could write this query:

  1. INNER JOIN with a RECURSIVE VIEW
  2. Implicit CROSS JOIN with RECURSIVE VIEW, filtered using a WHERE clause
  3. Sub-query with a RECURSIVE VIEW
  4. INNER JOIN with a MATERIALIZED VIEW based on the RECURSIVE VIEW
  5. Implicit CROSS JOIN (filtered) with MATERIALIZED VIEW based on RECURSIVE VIEW
  6. Sub-query with MATERIALIZED VIEW based on RECURSIVE VIEW
  7. RECURSIVE CTE, using an INNER JOIN
  8. RECURSIVE CTE, using an implicit CROSS JOIN (filtered)
  9. INNER JOIN with RECURSIVE CTE
  10. Implicit CROSS JOIN (filtered) with RECURSIVE CTE
  11. Subquery that is a RECURSIVE CTE

(Whilst some of these seem similar, we’ll see below how they differ).

In all cases, the actual query used for the tree will be:

SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle

It is extremely likely that the MATERIALIZED VIEW versions will be fastest: it is worth noting that in a very write-heavy environment (where you still need 100% up-to-date data), there would be an extra cost with the REFRESH MATERIALIZED VIEW.

As for which other ones will be fast (or fast enough), I would expect the CTE and VIEW versions to be roughly equivalent, as they appear to do the same amount of work. I’m not sure if the last three will perform as well as the others, as it seems that a “root” CTE would perform better than one later down the track.

So, let’s get underway. I wasn’t able to easily use the benchmark function I wanted to use, so I repeated each query five times and took the average.

We need our views:

CREATE RECURSIVE VIEW node_tree (node_id, ancestors, cycle) AS (
  SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
  SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
  FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
);

CREATE MATERIALIZED VIEW node_tree_mat AS (SELECT * FROM node_tree);

Results

So, some results. I’ll show the query, and then the timing (from EXPLAIN ANALYZE).

#1

SELECT * FROM node INNER JOIN node_tree USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 54.9ms (stddev 1.99)

#2

SELECT * FROM node n, node_tree t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 57.2ms (stddev 2.98)

#3

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree WHERE 1 = ANY(ancestors));

Average Time: 58.5ms (stddev 3.67)

#4

SELECT * FROM node INNER JOIN node_tree_mat USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 12.2ms (stddev 0.80)

#5

SELECT * FROM node n, node_tree_mat t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 11.7ms (stddev 0.90)

This is the fastest query, but not significantly more so than #4.

#6

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree_mat WHERE 1 = ANY(ancestors));

Average Time: 24.0ms (stddev 0.41)

Interestingly, this is much slower than using a JOIN.

#7

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n INNER JOIN node_tree_cte t USING (node_id)
WHERE 1 = ANY(t.ancestors);

Average Time: 97.0ms (stddev 1.87)

Immediately, we see that using a CTE has a performance hit over using a VIEW. Unexpected.

#8

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n, node_tree_cte t
WHERE n.node_id = t.node_id AND 1 = ANY(t.ancestors);

Average Time: 96.1ms (stddev 1.16)

#9

SELECT * FROM node INNER JOIN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte USING (node_id);

Average Time: 114.3ms (stddev 4.64)

This is the slowest (but again, only just, and not significantly more than #10).

#10

SELECT * FROM node, (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte WHERE node.node_id = node_tree_cte.node_id;

Average Time: 110.0ms (stddev 1.05)

#11

SELECT * FROM node WHERE node_id IN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
);

Average Time: 96.1ms (stddev 5.50)

Discussion

So, it appears that Common Table Expressions are nearly twice as slow as using a RECURSIVE VIEW. I didn’t expect that at all, as I thought they were equivalent. Unsurprisingly, MATERIALIZED VIEW is much faster.

This has some implications for the stuff I was working on: I was using a query of the #11 form (a sub-query that is a WITH RECURSIVE statement), which, as it turns out is just as fast as doing a “root” CTE. However, it’s still far slower than doing a JOIN with a VIEW.

The problem I have now is that there is no way to have the addition of a Field to a Model in django to cause an extra migration operation to be added. One solution would be to manually add a RunSQL operation, but that is messy. I’ll also have to investigate costs of REFRESH MATERIALIZED VIEW.