Postgres Tree Shootout part 2: Adjacency List using CTEs

2014-11-27 @ 00:06:43

here.

This is the second post in an ongoing series dealing with storing Hierarchical or Tree data structures in Postgres. You should read the Introduction if you haven’t already.

This post contains the queries that illustrate how an Adjacency List model can be used to represent a Hierarchical set of data, including data definitions, and the various operations that have been defined in the aforementioned introduction.

I’ve discussed Adjacency Lists in the past, but I’ll quickly recap why I think they are good.

They are conceptually simple to understand
They enforce referential integrity
They can be modelled with most ORMs without any extra infrastructure
Many of the operations are non-complex
Recursive queries allow us to perform the complex queries in reasonable time

To help build suspense (but more because I haven’t yet come up with a way to generate a nice reproducible, yet complex tree), this post may discuss the complexity of the queries, but will not contain any measurements.

Initial tree

Before each operation, our data will look like this (where parents point to children):

We will assume reversion back to this structure after each operation: we could do this using a TRUNCATE followed by an INSERT; or we could run the operation in a transaction and rollback.

There may be a post which shows the effects of each of the queries below in a graphical form.

Table structure

Adjacency Lists are dead simple. Each node simply contains a reference to it’s parent node.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

We can insert our data using a single statement:

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

Inserting a single leaf node is simple. To insert a single node as a child of node 13, for example:

INSERT INTO nodes (parent_id) VALUES (13);

Inserting a new root node is slightly more complicated. In most ORMs, we would probably do it in two queries: one to create the new node, and a second to update the other root nodes to point to this one. We’ll see that below.

We can do slightly better in raw SQL: UPDATE our table with the result of an INSERT ... RETURNING that occurs inside a Common Table Expression (CTE, or WITH query).

WITH parent AS (
  INSERT INTO nodes(parent_id)
  VALUES (NULL)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = parent.node_id
FROM parent
WHERE parent_id IS NULL;

We can use the same pattern to insert an intermediate node in our tree. For instance, inserting a new node between nodes 11 and 12:

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (11)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.node_id = 12;

And our last insert, adding a child node, that gets it’s siblings as children. For instance, adding a new node under node 12, which gets all of node 12’s children as it’s children.

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (12)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.parent_id = 12;

All of these queries should perform relatively well: the CTE will be very simple (as it is no different to the single leaf insert), and the UPDATE should likewise be fairly simple: it needs to filter out which existing nodes do not need to be updated; and then it needs to update the remainder of the rows with the value pulled from the CTE.

This theoretically is only marginally more complex than just a simple UPDATE foo SET bar=baz WHERE quux IS NULL style query.

If we were using an ORM, we might need to do this in two queries: something like this (in Django):

# Insert new root node: all other root nodes now have this as a parent
new_node = Node.objects.create()
Node.objects.filter(parent=None).exclude(pk=new_node.pk).update(parent=new_node)
# Could possibly do as:
Node.objects.filter(parent=None).update(parent=Node.objects.create().pk)

# Insert new node, with a single child as it's child (and that child's previous
# parent as it's parent)
new_node = Node.objects.create(parent=old_node.parent)
old_node.parent = new_node
old_node.save()

# Insert new node, with children of that node's parent now children of the node.
new_node = parent_node.children.create()
parent_node.children.exclude(pk=new_node.pk).update(parent=new_node)
# Again, may be able to do:
parent_node.children.update(parent=parent_node.children.create().pk)

Note the required exclusion of the newly created node: we don’t have to do this in the CTE versions, as that doesn’t “exist” at the time the other part of the query runs.

Removals

Removing a single leaf node is no different than removing a row from a normal table. Removing node 9, for instance:

DELETE FROM nodes WHERE node_id = 9;

Because the information about parent-child relationships is stored in the child, we do not need to do anything else to maintain the tree.

To remove a single root node (in this case, node 1), and promote all children to root nodes themselves, we can do two queries:

UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;
DELETE FROM nodes WHERE node_id = 1;

It may be possible to do this in a single query, similar to the CTE queries above, but I’m not sure of the benefit.

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 1
)
UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;

The same pattern can be used for removing a node, and attaching it’s children to it’s parent. Here, we will remove node 2, and attach it’s children (4 and 5) as children of it’s parent, node 1:

UPDATE nodes
SET parent_id = (SELECT parent_id FROM nodes WHERE node_id = 2)
WHERE parent_id = 2;

DELETE from nodes WHERE node_id = 2;

This is a place where using a CTE might make things clearer - especially if we have the node-to-be-deleted’s id, but not it’s parent:

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 2
  RETURNING node_id, parent_id
)
UPDATE nodes
SET parent_id = deleted.parent_id
FROM deleted
WHERE nodes.parent_id = deleted.node_id;

Righto, now we are up to the traditionally “hard” things for an Adjacency List to perform. Dealing with removing an arbitrary depth of (sub)tree.

We’ll need to create a recursive CTE, and delete according to that. Let’s just select from that initially, so we can see what the CTE data will look like:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT * FROM tree;

 node_id | ancestors
---------+------------
       1 | {}
      10 | {}
       2 | {1}
       3 | {1}
      11 | {10}
       4 | {1,2}
       5 | {1,2}
       6 | {1,3}
       7 | {1,3}
      12 | {10,11}
      13 | {10,11}
       8 | {1,2,4}
      14 | {10,11,12}
      15 | {10,11,12}
      16 | {10,11,12}
       9 | {1,2,4,8}
(16 rows)

Coolio. So, on to our operations. Let’s remove the subtree starting with node 2. I’ll hide the CTE, since it will be the same for quite a few of these operations:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
) OR node_id = 2;

The query is identical for a full tree (node 1 and descendants):

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 1 = ANY(tree.ancestors)
) OR node_id = 1;

And it’s nearly identical for just the descendants of a given node. Here, for all of node 2’s descendants, but not that node itself:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
);

Moves

Because the relationship is stored purely on the child element, moving around trees and subtrees is very easy. We can start with moving subtree starting with 3 to underneath node 4:

UPDATE nodes
SET parent_id = 4 WHERE node_id = 3;

Nothing surprising there. Similarly, the query is identical for moving a leaf to a different parent, a root node into a tree, and turning a subtree into a full tree (making that node a root node).

UPDATE nodes SET parent_id = 6 WHERE node_id = 5;
UPDATE nodes SET parent_id = 8 WHERE node_id = 10;
UPDATE nodes SET parent_id = NULL WHERE node_id = 2;

The final move: all of node’s children to a different node is almost as simple:

UPDATE nodes SET parent_id = 5 WHERE parent_id = 12;

This seems to be a situation where Adjacency Lists are really good. None of these queries are any more complex than the simplest UPDATE you could think of.

Fetches

Using the same CTE, we can perform our fetches. We may need to extend it to deal with depths, but since the ancestors column contains ancestors starting with the root node, we could count stuff in there. Let’s see how it goes.

Descendants

Fetching all descendants of a given node just means we want to see if the node occurs at all in each row’s ancestors. To get all of node 10’s descendants:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree WHERE 10 = ANY(tree.ancestors);

However, we could improve this by starting with just the node we care about, or more specifically, it’s children:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree;

Obviously, this is far less generic, but it is also significantly less complex. For starters, it only builds up the part of the tree we care about, and then just returns the node ids, rather than building up the whole tree, and then discarding the parts that are not required.

The same code can be used for determining the number of descendants, but with a COUNT(node_id) in the final query.

To get our depth-limited query, we can approach from two directions. To get the subtree to depth 2 from above:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
  AND cardinality(tree.ancestors) < 2
)
SELECT node_id FROM tree;

To do the same in the more generic form have to look at how close the desired node is to the end of the ancestors array:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id
FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

(Note that this is a bug fix on the original version of this post).

Ancestors

Fetching ancestors from our generic CTE is a bit simpler, because that data is already part of the query:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

To do the equivalent of the hand-built CTE, we would need to start with the node, and build back the other way. It’s getting late here, so I can’t think of a way to do this right now that doesn’t get stuck doing infinite recursion.

The count query is an interesting one: we can just remove the need to unnest, and take the cardinality:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT cardinality(ancestors) FROM tree WHERE node_id = 15;

The depth query is a little trickier. We want to know the ancestors of node 15, up to a depth of 2. If our ancestors array was in the reverse order, we should be able to unnest and limit.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, nodes.parent_id || tree.ancestors
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15 LIMIT 2;

We can do this because a node only has one parent: so limiting the number of ancestors (when sorted nearest ancestor first) is the same as limiting the depth.

Special queries

Fetching all leaf nodes is just a matter of excluding those that have a relationship to another node as it’s parent:

SELECT node_id FROM nodes
WHERE node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Trick for new players: if you leave off the WHERE clause in that sub-query, you won’t get any matches!

Fetching the number of leaf nodes is trivial.

Fetching root nodes (or the number of) is simpler than leaf nodes:

SELECT node_id FROM nodes
WHERE parent_id IS NULL;

Fetching non-leaf nodes, and non-root nodes is just negations of the two queries above:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

SELECT node_id FROM nodes WHERE parent_id IS NOT NULL;

And the non-leaf, non-root nodes just combines these two queries:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
) AND parent_id IS NOT NULL;

As an aside: there is also the inverse of this: nodes which are isolated (root AND leaf nodes):

SELECT node_id FROM nodes
WHERE parent_id IS NULL
AND node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Well, that’s our operations. Most of them are really simple. For anything that requires us to fetch or delete a whole subtree, we needed to revert to recursive CTEs, and for some of the other operations, using a CTE makes it simpler (and easier to understand).

Next, we will look at an alternative to the CTE operations, using a recursive view. From there, we should be able to look at a trigger-based approach that materializes our tree (node, ancestors) data in a table, and keeps it up to date. That’s, as I hinted, getting close to a Materialized Path approach, but keeps the conceptual simplicity of the Adjacency List, and hopefully prevents possible issues relating to referential integrity.