Tree data as a nested list redux

Some time ago, I wrote about using python to aggregate data that is stored with a Materialized Path into a Nested List structure.

But we should be able to do that same aggregation using Postgres, and from an Adjacency List structure.

Let’s start with a table definition:

CREATE TABLE location (
  node_id SERIAL PRIMARY KEY,
  name TEXT,
  parent_id INTEGER REFERENCES location(node_id)
);

And some data:

INSERT INTO location (node_id, name, parent_id) VALUES
  (1, 'Australia', NULL),
  (2, 'South Australia', 1),
  (3, 'Victoria', 1),
  (4, 'South-East', 2),
  (5, 'Western Districts', 3),
  (6, 'New Zealand', NULL),
  (7, 'Barossa Valley', 2),
  (8, 'Riverland', 2),
  (9, 'South Island', 6),
  (10, 'North Island', 6),
  (11, 'Eastern Bay of Plenty', 10);

To begin with, we need to get all of the items, and their depth:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
)
SELECT * FROM location_with_level;
 node_id │         name          │ parent_id │ lvl
─────────┼───────────────────────┼───────────┼─────
       1 │ Australia             │    <NULL> │   0
       6 │ New Zealand           │    <NULL> │   0
       2 │ South Australia       │         1 │   1
       3 │ Victoria              │         1 │   1
       9 │ South Island          │         6 │   1
      10 │ North Island          │         6 │   1
       4 │ South-East            │         2 │   2
       5 │ Western Districts     │         3 │   2
       7 │ Barossa Valley        │         2 │   2
       8 │ Riverland             │         2 │   2
      11 │ Eastern Bay of Plenty │        10 │   2
(11 rows)

Because of the way recursive queries work, we need to find the deepest node(s), and start there:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
)

SELECT * FROM maxlvl;

We then need to build up the tree (this clause is the next one in our CTE chain, I’ve omitted the first two for clarity):

c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

The first part of this query gets all of the deepest leaf nodes.

This is then combined with another recursive subquery, that creates branches. This relies on the fact it’s possible use the “type”, and have records as columns in a query. The second part of this subquery finds all remaining leaf nodes, and combines them in. This second subquery will keep executing until it doesn’t find any new rows, which will happen when all root nodes have been processed.

We can see from the results of this last clause that we just need to limit this to root nodes:

 node_id │         name          │ parent_id │ lvl │                                                   children
─────────┼───────────────────────┼───────────┼─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────
       4 │ South-East            │         2 │   2 │ <NULL>
       5 │ Western Districts     │         3 │   2 │ <NULL>
       7 │ Barossa Valley        │         2 │   2 │ <NULL>
       8 │ Riverland             │         2 │   2 │ <NULL>
      11 │ Eastern Bay of Plenty │        10 │   2 │ <NULL>
       3 │ Victoria              │         1 │   1 │ [{"name": "Western Districts", "children": null}]
      10 │ North Island          │         6 │   1 │ [{"name": "Eastern Bay of Plenty", "children": null}]
       9 │ South Island          │         6 │   1 │ <NULL>
       2 │ South Australia       │         1 │   1 │ [{"name": "Riverland", "children": null}, {"name": "Barossa Valley", "children": null}, {"name": "South-East…
         │                       │           │     │…", "children": null}]
       6 │ New Zealand           │    <NULL> │   0 │ [{"name": "South Island", "children": null}, {"name": "North Island", "children": [{"name": "Eastern Bay of …
         │                       │           │     │…Plenty", "children": null}]}]
       1 │ Australia             │    <NULL> │   0 │ [{"name": "South Australia", "children": [{"name": "Riverland", "children": null}, {"name": "Barossa Valley"…
         │                       │           │     │…, "children": null}, {"name": "South-East", "children": null}]}, {"name": "Victoria", "children": [{"name": …
         │                       │           │     │…"Western Districts", "children": null}]}]
(11 rows)

So our final query, using the new jsonb_pretty function:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
),
c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

SELECT jsonb_pretty(
         array_to_json(
           array_agg(
             row_to_json(c_tree)::JSONB - 'lvl' - 'parent_id' - 'node_id'
           )
         )::JSONB
       ) AS tree
  FROM c_tree
  WHERE lvl=0;

And our results:

                           tree
 ──────────────────────────────────────────────────────────
  [
      {
          "name": "New Zealand",
          "children": [
              {
                  "name": "South Island",
                  "children": null
              },
              {
                  "name": "North Island",
                  "children": [
                      {
                          "name": "Eastern Bay of Plenty",
                          "children": null
                      }
                  ]
              }
          ]
      },
      {
          "name": "Australia",
          "children": [
              {
                  "name": "South Australia",
                  "children": [
                      {
                          "name": "Riverland",
                          "children": null
                      },
                      {
                          "name": "Barossa Valley",
                          "children": null
                      },
                      {
                          "name": "South-East",
                          "children": null
                      }
                  ]
              },
              {
                  "name": "Victoria",
                  "children": [
                      {
                          "name": "Western Districts",
                          "children": null
                      }
                  ]
              }
          ]
      }
  ]
 (1 row)

Oh, that is rather neat.

This query is mostly cribbed from a fantastic Stack Overflow answer by David Guillot.

Django Trees via Closure View

After writing up a method of using a Postgres View that generates a materialised path within the context of a Django model, I came across some queries of my data that were getting rather troublesome to write. It occurred to me that having a closure table would be useful. Specifically, I needed all of the descendants of a given set of nodes.

I couldn’t find an existing Postgres extension that will manage the closure table, and didn’t feel like writing my own implemention using triggers just yet. However, it occurred to me that I could use a similar trick to the recursive materialised path view. Thus, we have a Closure View.

We will start with the Django models:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.Node', related_name='children', null=True, blank=True)

    descendants = models.ManyToManyField('tree.Node', related_name='ancestors', through='tree.Closure')

    class Meta:
        app_label = 'tree'


class Closure(models.Model):
    path = ArrayField(base_field=models.IntegerField(), primary_key=True)
    ancestor = models.ForeignKey('tree.Node', related_name='+')
    descendant = models.ForeignKey('tree.Node', related_name='+')
    depth = models.IntegerField()

    class Meta:
        app_label = 'tree'
        managed = False

You may notice I have a path column. I’m using this for the primary key, and it may turn out to be useful later.

Let’s have a look at the View:

CREATE RECURSIVE VIEW tree_closure(path, ancestor_id, descendant_id, depth) AS

SELECT ARRAY[node_id], node_id, node_id, 0 FROM tree_node

UNION ALL

SELECT parent_id || path, parent_id, descendant_id, depth + 1
FROM tree_node INNER JOIN tree_closure ON (ancestor_id = node_id)
WHERE parent_id IS NOT NULL;

This uses a recursive query. The first part builds the self-reference relations, and the second part uses the RECURSIVE function to collect child nodes for each node already in the table (or added in previous iterations of this part of the view).

Now, because we are using the in-built Django Many to Many features, we have some nice queries ready to go:

  • node.ancestors.all() : All ancestors of a given Node instance.
  • node.descendants.all() : All descendants of a given Node instance.
  • Node.objects.filter(ancestors=queryset) : All descendants of all nodes in a queryset.
  • Node.objects.filter(descendants=queryset) : All ancestors of all nodes in a queryset.

Of particular note are the bottom two: these are rather cumbersome to write in older versions of Django.

Adjacency Lists in Django with Postgres

Today, I’m going to walk through modelling a tree in Django, using an Adjacency List, and a Postgres View that dynamically creates the materialised path of ancestors for each node.

With this, we will be able to query the tree for a range of operations using the Django ORM.

We will start with our model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.node', related_name='children', null=True, blank=True)

    class Meta:
        app_label = 'tree'

We will also build an unmanaged model that will be backed by our view.

from django.contrib.postgres.fields import ArrayField

class Tree(models.Model):
    root = models.ForeignKey(Node, related_name='+')
    node = models.OneToOneField(Node, related_name='tree_node', primary_key=True)
    ancestors = ArrayField(base_field=models.IntegerField())

    class Meta:
        app_label = 'tree'
        managed = False

You’ll notice I’ve included a root relation. This could be obtained by using ancestors[0] if ancestors else node_id, but that’s a bit cumbersome.

So, on to the View:

CREATE RECURSIVE VIEW tree_tree(root_id, node_id, ancestors) AS

SELECT node_id, node_id, ARRAY[]::INTEGER[]
FROM tree_node WHERE parent_id IS NULL

UNION ALL

SELECT tree.root_id, node.node_id, tree.ancestors || node.parent_id
FROM tree_node node INNER JOIN tree_tree tree ON (node.parent_id = tree.node_id)

I’ve written this view before, so I won’t go into any detail.

We can create a tree. Normally I wouldn’t specify the primary key, but since we want to talk about those values shortly, I will. It also means you can delete them, and recreate with this code, and not worry about the sequence values.

from tree.models import Node

Node.objects.bulk_create([
  Node(pk=1),
  Node(pk=2, parent_id=1),
  Node(pk=3, parent_id=1),
  Node(pk=4, parent_id=2),
  Node(pk=5, parent_id=2),
  Node(pk=6, parent_id=3),
  Node(pk=7, parent_id=3),
  Node(pk=8, parent_id=4),
  Node(pk=9, parent_id=8),
  Node(pk=10),
  Node(pk=11, parent_id=10),
  Node(pk=12, parent_id=11),
  Node(pk=13, parent_id=11),
  Node(pk=14, parent_id=12),
  Node(pk=15, parent_id=12),
  Node(pk=16, parent_id=12),
])

Okay, let’s start looking at how we might perform some operations on it.

We’ve already seen how to create a node, either root or leaf nodes. No worries there.

What about inserting an intermediate node, say between 11 and 12?

node = Node.objects.create(parent_id=11)
node.parent.children.exclude(pk=node.pk).update(parent=node)

I’m not sure if it is possible to do it in a single statement.

Okay, let’s jump to some tree-based statements. We’ll start by finding a sub-tree.

Node.objects.filter(tree_node__ancestors__contains=[2])

Oh, that’s pretty nice. It’s not necessarily sorted, but it will do for now.

We can also query directly for a root:

Node.objects.filter(tree_node__root=10)

We could spell that one as tree_node__ancestors__0=10, but I think this is more explicit. Also, that one will not include the root node itself.

Deletions are also simple: if we can build a queryset, we can delete it. Thus, deleting a full tree could be done by following any queryset by a .delete()

Fetching a node’s ancestors is a little trickier: because we only have an array of node ids; thus it does two queries.

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors)

The count of ancestors doesn’t require the second query:

len(Node.objects.get(pk=15).tree_node.ancestors)

Getting ancestors to a given depth is also simple, although it still requires two queries:

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors[-2:])

This is a fairly simple way to enable relatively performance-aware queries of tree data. There are still places where it’s not perfect, and in reality, you’d probably look at building up queryset or model methods for wrapping common operations.

Postgres Tree Shootout part 3: Adjacency List using Views

It’s been a while, but I’ve finally gotten off my arsefound some time to revisit this series. As promised last time, I’m going to rewrite the queries from the Adjacency List “solutions” using a View. Indeed, there will be two versions of the view - one which is a MATERIALIZED VIEW. There will also be discussion of when the two different types of view might be best to use.

One of the reasons this post took so long to write was that I was sidetracked by writing an SVG generator that would allow for graphically seeing what the different operations discussed in this series look like in terms of an actual tree. That didn’t eventuate.

We will start by defining what our tree view will actually look like. You’ll notice is it rather like the CTE that we saw in the previous post.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
);

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

All of the insertions do not require access to the tree view, since the beauty of an Adjacency List model is that you only ever need to operate on the immediate parent-child.

Removals

Similarly, we will skip over the simple operations: those don’t require access to any more of the tree than just the parent-child relationship. It’s not until we need to remove a subtree that it becomes interesting.

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
) OR node_id = 2;

If you are paying attention, you will notice that this is virtually identical to the CTE version, except that we no longer need to redeclare the CTE each time. The full tree deletion is the same, as is removing all decscendants:

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
);

Moves

Again, the operations that don’t require the actual tree are unchanged: this is where the Adjacency List really shines.

Fetches

Since we are starting with the “full” tree, we should be able to use it for all of the queries. It is possible that these queries (unlike those we have seen before) may be slightly slower than the CTE version (specifically, those where the CTE is customised for that operation).

Descendants

Let’s get all of node 10’s descendants:

SELECT node_id FROM tree WHERE 10 = ANY(ancestors);

This query is far less complicated than the CTE version, as expected. However, when dealing with very large datasets, it performs far worse. I have a data set with 221000 nodes, in 1001 different trees. Performing this query takes around 5 seconds, but the customised CTE version takes about 750ms.

Turning this view into a materialised view:

CREATE MATERIALIZED VIEW tree_mat AS
SELECT node_id, ancestors FROM tree;

and then querying that turns this into around 75ms.

To limit the query to nodes to a given depth requires slightly more work.

SELECT node_id, ancestors FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

Ancestors

Fetching ancestors of a node is again trivial:

SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

And the count of ancestors:

SELECT ARRAY_LENGTH(ancestors, 1) FROM tree WHERE node_id=15;

Getting a set of ancestors to a given depth is actually a little tricky: because we can’t just reverse the end that we add the parent node to the ancestors array, we can’t use that trick. We’ll have to enumerate the rows, and then extract those we care about. You can’t use OFFSET with a variable, otherwise that would be a nice trick.

WITH ancestors AS (
  SELECT unnest(ancestors) AS node_id
  FROM tree
  WHERE node_id = 15
), enumerated AS (
  SELECT
    row_number() OVER () AS row,
    count(*) OVER () AS ancestor_count,
    node_id
  FROM ancestors
)
SELECT node_id
FROM enumerated
WHERE "row" > ancestor_count - 2;

Ugh. That’s way worse than the CTE version.

Special queries

None of the special queries access the tree in any way, so can be omitted for now.

Discussion

So how does using a view stack up to the ad-hoc CTE queries?

Mostly pretty well. In the case where you have only small data sets, then the view that builds up the complete tree each time is not that much of a problem (I ran some tests with tens of thousands of items, and it still performed relatively well). When it moves up to hundreds of thousands, then the ad-hoc CTE versions can greatly outperform the full tree view.

However, using a materialised view changes everything. It now becomes just as fast as querying a table: indeed, that’s just what it is. You could have triggers based on changes to the nodes table causing a REFRESH MATERIALIZED VIEW, but it is worth keeping in mind that this will take some time: in my case, a full refresh of 221000 rows took upwards of 4.5 seconds.

Using a materialised view gets us most of the way to (and leads nicely into the next method), storing a materialised path. The similarity of the names here should be a trigger, but now I’m just making foreshadowing jokes.

Tree data as a nested list

One of the nice things about Adjacency Lists as a method of storing tree structures is that there is not much redundancy: you only store a reference to the parent, and that’s it.

It does mean that getting that data in a nested object is a bit complicated. I’ve written before about getting data out of a database: I’ll revisit that again I’m sure, but for now, I’m going to deal with data that has the following shape: that is, has been built up into a Materialized Path:

[
  {
    "node": 1,
    "ancestors": [],
    "label": "Australia"
  },
  {
    "node": 2,
    "ancestors": [1],
    "label": "South Australia"
  },
  {
    "node": 3,
    "ancestors": [1],
    "label": "Victoria"
  },
  {
    "node": 4,
    "ancestors": [1, 2],
    "label": "South-East"
  },
  {
    "node": 5,
    "ancestors": [1, 3],
    "label": "Western Districts"
  },
  {
    "node": 6,
    "ancestors": [],
    "label": "New Zealand"
  },
  {
    "node": 7,
    "ancestors": [1, 2],
    "label": "Barossa Valley"
  },
  {
    "node": 8,
    "ancestors": [1, 2],
    "label": "Riverland"
  }
]

From here, we want to build up something that looks like:

  • Australia
    • South Australia
      • Barossa Valley
      • Riverland
      • South East
    • Victoria
      • Western Districts
  • New Zealand

Or, a nested python data structure:

[
  ('Australia', [
    ('South Australia', [
      ('Barossa Valley', []),
      ('Riverland', []),
      ('South-East', [])
    ]),
    ('Victoria', [
      ('Western Districts', [])
    ])
  ]),
  ('New Zealand', [])
]

You’ll see that each node is a 2-tuple, and each set of siblings is a list. Even a node with no children still gets an empty list.

We can build up this data structure in two steps: based on the fact that a dict, as key-value pairs, matches a 2-tuple. That is, we will start by creating:

{
  1: {
    2: {
      4: {},
      7: {},
      8: {},
    },
    3: {
      5: {},
    }
  },
  6: {},
}

You might be reaching for python’s defaultdict class at this point, but there is a slightly nicer way:

class Tree(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

(Note: This class, and the seed of the idea, came from this answer on StackOverflow).

We can also create a recursive method on this class that creates a node and all of it’s ancestors:

    def insert(self, key, ancestors):
        if ancestors:
            self[ancestors[0]].insert(key, ancestors[1:])
        else:
          self[key]
>>> tree = Tree()
>>> for node in data:
...     tree.insert(node['node'], node['ancestors'])
>>> print tree
{1: {2: {8: {}, 4: {}, 7: {}}, 3: {5: {}}}, 6: {}}

Looking good.

Let’s make another method that allows us to actually insert the labels (and apply a sort, if we want):

    def label(self, label_dict, sort_key=lambda x: x[0]):
        return sorted([
          (label_dict.get(key), value.label(label_dict, sort_key))
          for key, value in self.items()
        ], key=sort_key)

We also need to build up the simple key-value store to pass as label_dict, but that’s pretty easy.

Let’s look at the full code: first the complete class:

class Tree(dict):
    """Simple Tree data structure

    Stores data in the form:

    {
        "a": {
            "b": {},
            "c": {},
        },
        "d": {
            "e": {},
        },
    }

    And can be nested to any depth.
    """

    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

    def insert(self, node, ancestors):
        """Insert the supplied node, creating all ancestors as required.

        This expects a list (possibly empty) containing the ancestors,
        and a value for the node.
        """
        if not ancestors:
            self[node]
        else:
            self[ancestors[0]].insert(node, ancestors[1:])

    def label(self, labels, sort_key=lambda x: x[0]):
        """Return a nested 2-tuple with just the supplied labels.

        Realistically, the labels could be any type of object.
        """
        return sorted([
            (
                labels.get(key),
                value.label(labels, sort_key)
            ) for key, value in self.items()
        ], key=sort_key)

Now, using it:

>>> tree = Tree()
>>> labels = {}
>>>
>>> for node in data:
>>>     tree.insert(node['node'], node['ancestors'])
>>>     labels[node['node']] = node['label']
>>>
>>> from pprint import pprint
>>> pprint(tree.label(labels))

[('Australia',
  [('South Australia',
    [('Barossa Valley', []), ('Riverland', []), ('South-East', [])]),
   ('Victoria', [('Western Districts', [])])]),
 ('New Zealand', [])]

Awesome. Now use your template rendering of choice to turn this into a nicely formatted list.

Postgres Tree Shootout part 2: Adjacency List using CTEs

This is the second post in an ongoing series dealing with storing Hierarchical or Tree data structures in Postgres. You should read the Introduction if you haven’t already.

This post contains the queries that illustrate how an Adjacency List model can be used to represent a Hierarchical set of data, including data definitions, and the various operations that have been defined in the aforementioned introduction.

I’ve discussed Adjacency Lists in the past, but I’ll quickly recap why I think they are good.

  • They are conceptually simple to understand
  • They enforce referential integrity
  • They can be modelled with most ORMs without any extra infrastructure
  • Many of the operations are non-complex
  • Recursive queries allow us to perform the complex queries in reasonable time

To help build suspense (but more because I haven’t yet come up with a way to generate a nice reproducible, yet complex tree), this post may discuss the complexity of the queries, but will not contain any measurements.

Initial tree

Before each operation, our data will look like this (where parents point to children):

2014-11-26 11:27ZCanvas 1Layer 112345678910111213141516

We will assume reversion back to this structure after each operation: we could do this using a TRUNCATE followed by an INSERT; or we could run the operation in a transaction and rollback.

There may be a post which shows the effects of each of the queries below in a graphical form.

Table structure

Adjacency Lists are dead simple. Each node simply contains a reference to it’s parent node.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

We can insert our data using a single statement:

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

Inserting a single leaf node is simple. To insert a single node as a child of node 13, for example:

INSERT INTO nodes (parent_id) VALUES (13);

Inserting a new root node is slightly more complicated. In most ORMs, we would probably do it in two queries: one to create the new node, and a second to update the other root nodes to point to this one. We’ll see that below.

We can do slightly better in raw SQL: UPDATE our table with the result of an INSERT ... RETURNING that occurs inside a Common Table Expression (CTE, or WITH query).

WITH parent AS (
  INSERT INTO nodes(parent_id)
  VALUES (NULL)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = parent.node_id
FROM parent
WHERE parent_id IS NULL;

We can use the same pattern to insert an intermediate node in our tree. For instance, inserting a new node between nodes 11 and 12:

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (11)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.node_id = 12;

And our last insert, adding a child node, that gets it’s siblings as children. For instance, adding a new node under node 12, which gets all of node 12’s children as it’s children.

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (12)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.parent_id = 12;

All of these queries should perform relatively well: the CTE will be very simple (as it is no different to the single leaf insert), and the UPDATE should likewise be fairly simple: it needs to filter out which existing nodes do not need to be updated; and then it needs to update the remainder of the rows with the value pulled from the CTE.

This theoretically is only marginally more complex than just a simple UPDATE foo SET bar=baz WHERE quux IS NULL style query.

If we were using an ORM, we might need to do this in two queries: something like this (in Django):

# Insert new root node: all other root nodes now have this as a parent
new_node = Node.objects.create()
Node.objects.filter(parent=None).exclude(pk=new_node.pk).update(parent=new_node)
# Could possibly do as:
Node.objects.filter(parent=None).update(parent=Node.objects.create().pk)

# Insert new node, with a single child as it's child (and that child's previous
# parent as it's parent)
new_node = Node.objects.create(parent=old_node.parent)
old_node.parent = new_node
old_node.save()

# Insert new node, with children of that node's parent now children of the node.
new_node = parent_node.children.create()
parent_node.children.exclude(pk=new_node.pk).update(parent=new_node)
# Again, may be able to do:
parent_node.children.update(parent=parent_node.children.create().pk)

Note the required exclusion of the newly created node: we don’t have to do this in the CTE versions, as that doesn’t “exist” at the time the other part of the query runs.

Removals

Removing a single leaf node is no different than removing a row from a normal table. Removing node 9, for instance:

DELETE FROM nodes WHERE node_id = 9;

Because the information about parent-child relationships is stored in the child, we do not need to do anything else to maintain the tree.

To remove a single root node (in this case, node 1), and promote all children to root nodes themselves, we can do two queries:

UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;
DELETE FROM nodes WHERE node_id = 1;

It may be possible to do this in a single query, similar to the CTE queries above, but I’m not sure of the benefit.

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 1
)
UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;

The same pattern can be used for removing a node, and attaching it’s children to it’s parent. Here, we will remove node 2, and attach it’s children (4 and 5) as children of it’s parent, node 1:

UPDATE nodes
SET parent_id = (SELECT parent_id FROM nodes WHERE node_id = 2)
WHERE parent_id = 2;

DELETE from nodes WHERE node_id = 2;

This is a place where using a CTE might make things clearer - especially if we have the node-to-be-deleted’s id, but not it’s parent:

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 2
  RETURNING node_id, parent_id
)
UPDATE nodes
SET parent_id = deleted.parent_id
FROM deleted
WHERE nodes.parent_id = deleted.node_id;

Righto, now we are up to the traditionally “hard” things for an Adjacency List to perform. Dealing with removing an arbitrary depth of (sub)tree.

We’ll need to create a recursive CTE, and delete according to that. Let’s just select from that initially, so we can see what the CTE data will look like:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT * FROM tree;
 node_id | ancestors
---------+------------
       1 | {}
      10 | {}
       2 | {1}
       3 | {1}
      11 | {10}
       4 | {1,2}
       5 | {1,2}
       6 | {1,3}
       7 | {1,3}
      12 | {10,11}
      13 | {10,11}
       8 | {1,2,4}
      14 | {10,11,12}
      15 | {10,11,12}
      16 | {10,11,12}
       9 | {1,2,4,8}
(16 rows)

Coolio. So, on to our operations. Let’s remove the subtree starting with node 2. I’ll hide the CTE, since it will be the same for quite a few of these operations:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
) OR node_id = 2;

The query is identical for a full tree (node 1 and descendants):

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 1 = ANY(tree.ancestors)
) OR node_id = 1;

And it’s nearly identical for just the descendants of a given node. Here, for all of node 2’s descendants, but not that node itself:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
);

Moves

Because the relationship is stored purely on the child element, moving around trees and subtrees is very easy. We can start with moving subtree starting with 3 to underneath node 4:

UPDATE nodes
SET parent_id = 4 WHERE node_id = 3;

Nothing surprising there. Similarly, the query is identical for moving a leaf to a different parent, a root node into a tree, and turning a subtree into a full tree (making that node a root node).

UPDATE nodes SET parent_id = 6 WHERE node_id = 5;
UPDATE nodes SET parent_id = 8 WHERE node_id = 10;
UPDATE nodes SET parent_id = NULL WHERE node_id = 2;

The final move: all of node’s children to a different node is almost as simple:

UPDATE nodes SET parent_id = 5 WHERE parent_id = 12;

This seems to be a situation where Adjacency Lists are really good. None of these queries are any more complex than the simplest UPDATE you could think of.

Fetches

Using the same CTE, we can perform our fetches. We may need to extend it to deal with depths, but since the ancestors column contains ancestors starting with the root node, we could count stuff in there. Let’s see how it goes.

Descendants

Fetching all descendants of a given node just means we want to see if the node occurs at all in each row’s ancestors. To get all of node 10’s descendants:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree WHERE 10 = ANY(tree.ancestors);

However, we could improve this by starting with just the node we care about, or more specifically, it’s children:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree;

Obviously, this is far less generic, but it is also significantly less complex. For starters, it only builds up the part of the tree we care about, and then just returns the node ids, rather than building up the whole tree, and then discarding the parts that are not required.

The same code can be used for determining the number of descendants, but with a COUNT(node_id) in the final query.

To get our depth-limited query, we can approach from two directions. To get the subtree to depth 2 from above:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
  AND cardinality(tree.ancestors) < 2
)
SELECT node_id FROM tree;

To do the same in the more generic form have to look at how close the desired node is to the end of the ancestors array:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id
FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

(Note that this is a bug fix on the original version of this post).

Ancestors

Fetching ancestors from our generic CTE is a bit simpler, because that data is already part of the query:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

To do the equivalent of the hand-built CTE, we would need to start with the node, and build back the other way. It’s getting late here, so I can’t think of a way to do this right now that doesn’t get stuck doing infinite recursion.

The count query is an interesting one: we can just remove the need to unnest, and take the cardinality:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT cardinality(ancestors) FROM tree WHERE node_id = 15;

The depth query is a little trickier. We want to know the ancestors of node 15, up to a depth of 2. If our ancestors array was in the reverse order, we should be able to unnest and limit.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, nodes.parent_id || tree.ancestors
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15 LIMIT 2;

We can do this because a node only has one parent: so limiting the number of ancestors (when sorted nearest ancestor first) is the same as limiting the depth.

Special queries

Fetching all leaf nodes is just a matter of excluding those that have a relationship to another node as it’s parent:

SELECT node_id FROM nodes
WHERE node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Trick for new players: if you leave off the WHERE clause in that sub-query, you won’t get any matches!

Fetching the number of leaf nodes is trivial.

Fetching root nodes (or the number of) is simpler than leaf nodes:

SELECT node_id FROM nodes
WHERE parent_id IS NULL;

Fetching non-leaf nodes, and non-root nodes is just negations of the two queries above:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

SELECT node_id FROM nodes WHERE parent_id IS NOT NULL;

And the non-leaf, non-root nodes just combines these two queries:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
) AND parent_id IS NOT NULL;

As an aside: there is also the inverse of this: nodes which are isolated (root AND leaf nodes):

SELECT node_id FROM nodes
WHERE parent_id IS NULL
AND node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Well, that’s our operations. Most of them are really simple. For anything that requires us to fetch or delete a whole subtree, we needed to revert to recursive CTEs, and for some of the other operations, using a CTE makes it simpler (and easier to understand).

Next, we will look at an alternative to the CTE operations, using a recursive view. From there, we should be able to look at a trigger-based approach that materializes our tree (node, ancestors) data in a table, and keeps it up to date. That’s, as I hinted, getting close to a Materialized Path approach, but keeps the conceptual simplicity of the Adjacency List, and hopefully prevents possible issues relating to referential integrity.

Postgres Tree Shootout part 1: Introduction.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.


Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.


Operations

I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.

Insertions

  • Insert a single leaf node
  • Insert a single root node (all existing root nodes will then point to this)
  • Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)
  • Insert a single node partway through the tree, with this node’s parent’s existing children all becoming children of the new node.

Removals

  • Remove a single leaf node
  • Remove a single root node (all children of this are promoted to root nodes)
  • Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
  • Remove a subtree (a single non-root node and it’s descendants)
  • Remove a whole tree (a single root node and all descendants)
  • Remove all descendants of a specific node (but not the node itself)

Moves

  • Move a subtree from one parent to another
  • Move a single leaf node to a different parent
  • Move a root node into a tree
  • Make a subtree into a tree (turn a node into a root node).
  • Move all children of a node to a different parent

Fetches

  • Fetch all descendants of a given node
  • Fetch the number of descendants of a given node
  • Fetch descendants of a given node to a given depth
  • Fetch the number of descendants of a given node to a given depth
  • Fetch all ancestors of a given node
  • Fetch the number of ancestors of a given node
  • Fetch ancestors of a given node to a given depth
  • Fetch the number of ancestors of a given node to a given depth

    I don’t think this makes any sense.

  • Fetch all leaf nodes
  • Fetch the number of leaf nodes
  • Fetch all root nodes
  • Fetch the number of root nodes
  • Fetch all non-leaf nodes
  • Fetch the number of non-leaf nodes
  • Fetch all non-root nodes
  • Fetch the number of non-root nodes
  • Fetch all non-root, non-leaf nodes
  • Fetch the number of non-root, non-leaf nodes

Performance testing Adjancency List recursive queries

Yesterday, I wrote up some ideas about doing recursive queries on Adjacency Lists using Postgres. Today, I wrote up some code that allows me to run some tests on larger data sets. It’s worth noting that this is still somewhat “toy” data, but I did see comparable results with a real query.

Firstly, our data structure:

CREATE TABLE node (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES node(node_id)
);

Now, we want to be able to populate it with test data. This function will allow you to populate any number of records, with a 10% chance that any given record will be a root (have no parent). If it has a parent, it will be randomly selected from all existing rows. This means that earlier rows have a much higher chance of being a parent, and the first row is overwhemingly likely to have the most descendants (as it has a 90% chance that row 2 will have it as a parent, and therefore any descendants of that will also be descendants of row 1…)

CREATE OR REPLACE FUNCTION populate_nodes(count integer) RETURNS void AS $$
BEGIN
  FOR i IN 2..count LOOP
    IF ((SELECT count(*) FROM node) = 0) or (random() < 0.1) THEN
      INSERT INTO node (parent_id) SELECT NULL;
    ELSE
      INSERT INTO node (parent_id) SELECT node_id FROM node OFFSET random() * (SELECT count(*) FROM node) LIMIT 1;
    END IF;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Let's stick 10k records in there
SELECT populate_nodes(10000);

For now, we want to find all descendants of node 1.

I can think of eleven ways we could write this query:

  1. INNER JOIN with a RECURSIVE VIEW
  2. Implicit CROSS JOIN with RECURSIVE VIEW, filtered using a WHERE clause
  3. Sub-query with a RECURSIVE VIEW
  4. INNER JOIN with a MATERIALIZED VIEW based on the RECURSIVE VIEW
  5. Implicit CROSS JOIN (filtered) with MATERIALIZED VIEW based on RECURSIVE VIEW
  6. Sub-query with MATERIALIZED VIEW based on RECURSIVE VIEW
  7. RECURSIVE CTE, using an INNER JOIN
  8. RECURSIVE CTE, using an implicit CROSS JOIN (filtered)
  9. INNER JOIN with RECURSIVE CTE
  10. Implicit CROSS JOIN (filtered) with RECURSIVE CTE
  11. Subquery that is a RECURSIVE CTE

(Whilst some of these seem similar, we’ll see below how they differ).

In all cases, the actual query used for the tree will be:

SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle

It is extremely likely that the MATERIALIZED VIEW versions will be fastest: it is worth noting that in a very write-heavy environment (where you still need 100% up-to-date data), there would be an extra cost with the REFRESH MATERIALIZED VIEW.

As for which other ones will be fast (or fast enough), I would expect the CTE and VIEW versions to be roughly equivalent, as they appear to do the same amount of work. I’m not sure if the last three will perform as well as the others, as it seems that a “root” CTE would perform better than one later down the track.

So, let’s get underway. I wasn’t able to easily use the benchmark function I wanted to use, so I repeated each query five times and took the average.

We need our views:

CREATE RECURSIVE VIEW node_tree (node_id, ancestors, cycle) AS (
  SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
  SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
  FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
);

CREATE MATERIALIZED VIEW node_tree_mat AS (SELECT * FROM node_tree);

Results

So, some results. I’ll show the query, and then the timing (from EXPLAIN ANALYZE).

#1

SELECT * FROM node INNER JOIN node_tree USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 54.9ms (stddev 1.99)

#2

SELECT * FROM node n, node_tree t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 57.2ms (stddev 2.98)

#3

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree WHERE 1 = ANY(ancestors));

Average Time: 58.5ms (stddev 3.67)

#4

SELECT * FROM node INNER JOIN node_tree_mat USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 12.2ms (stddev 0.80)

#5

SELECT * FROM node n, node_tree_mat t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 11.7ms (stddev 0.90)

This is the fastest query, but not significantly more so than #4.

#6

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree_mat WHERE 1 = ANY(ancestors));

Average Time: 24.0ms (stddev 0.41)

Interestingly, this is much slower than using a JOIN.

#7

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n INNER JOIN node_tree_cte t USING (node_id)
WHERE 1 = ANY(t.ancestors);

Average Time: 97.0ms (stddev 1.87)

Immediately, we see that using a CTE has a performance hit over using a VIEW. Unexpected.

#8

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n, node_tree_cte t
WHERE n.node_id = t.node_id AND 1 = ANY(t.ancestors);

Average Time: 96.1ms (stddev 1.16)

#9

SELECT * FROM node INNER JOIN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte USING (node_id);

Average Time: 114.3ms (stddev 4.64)

This is the slowest (but again, only just, and not significantly more than #10).

#10

SELECT * FROM node, (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte WHERE node.node_id = node_tree_cte.node_id;

Average Time: 110.0ms (stddev 1.05)

#11

SELECT * FROM node WHERE node_id IN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
);

Average Time: 96.1ms (stddev 5.50)

Discussion

So, it appears that Common Table Expressions are nearly twice as slow as using a RECURSIVE VIEW. I didn’t expect that at all, as I thought they were equivalent. Unsurprisingly, MATERIALIZED VIEW is much faster.

This has some implications for the stuff I was working on: I was using a query of the #11 form (a sub-query that is a WITH RECURSIVE statement), which, as it turns out is just as fast as doing a “root” CTE. However, it’s still far slower than doing a JOIN with a VIEW.

The problem I have now is that there is no way to have the addition of a Field to a Model in django to cause an extra migration operation to be added. One solution would be to manually add a RunSQL operation, but that is messy. I’ll also have to investigate costs of REFRESH MATERIALIZED VIEW.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')