Performance testing Adjancency List recursive queries

Yesterday, I wrote up some ideas about doing recursive queries on Adjacency Lists using Postgres. Today, I wrote up some code that allows me to run some tests on larger data sets. It’s worth noting that this is still somewhat “toy” data, but I did see comparable results with a real query.

Firstly, our data structure:

CREATE TABLE node (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES node(node_id)
);

Now, we want to be able to populate it with test data. This function will allow you to populate any number of records, with a 10% chance that any given record will be a root (have no parent). If it has a parent, it will be randomly selected from all existing rows. This means that earlier rows have a much higher chance of being a parent, and the first row is overwhemingly likely to have the most descendants (as it has a 90% chance that row 2 will have it as a parent, and therefore any descendants of that will also be descendants of row 1…)

CREATE OR REPLACE FUNCTION populate_nodes(count integer) RETURNS void AS $$
BEGIN
  FOR i IN 2..count LOOP
    IF ((SELECT count(*) FROM node) = 0) or (random() < 0.1) THEN
      INSERT INTO node (parent_id) SELECT NULL;
    ELSE
      INSERT INTO node (parent_id) SELECT node_id FROM node OFFSET random() * (SELECT count(*) FROM node) LIMIT 1;
    END IF;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Let's stick 10k records in there
SELECT populate_nodes(10000);

For now, we want to find all descendants of node 1.

I can think of eleven ways we could write this query:

  1. INNER JOIN with a RECURSIVE VIEW
  2. Implicit CROSS JOIN with RECURSIVE VIEW, filtered using a WHERE clause
  3. Sub-query with a RECURSIVE VIEW
  4. INNER JOIN with a MATERIALIZED VIEW based on the RECURSIVE VIEW
  5. Implicit CROSS JOIN (filtered) with MATERIALIZED VIEW based on RECURSIVE VIEW
  6. Sub-query with MATERIALIZED VIEW based on RECURSIVE VIEW
  7. RECURSIVE CTE, using an INNER JOIN
  8. RECURSIVE CTE, using an implicit CROSS JOIN (filtered)
  9. INNER JOIN with RECURSIVE CTE
  10. Implicit CROSS JOIN (filtered) with RECURSIVE CTE
  11. Subquery that is a RECURSIVE CTE

(Whilst some of these seem similar, we’ll see below how they differ).

In all cases, the actual query used for the tree will be:

SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle

It is extremely likely that the MATERIALIZED VIEW versions will be fastest: it is worth noting that in a very write-heavy environment (where you still need 100% up-to-date data), there would be an extra cost with the REFRESH MATERIALIZED VIEW.

As for which other ones will be fast (or fast enough), I would expect the CTE and VIEW versions to be roughly equivalent, as they appear to do the same amount of work. I’m not sure if the last three will perform as well as the others, as it seems that a “root” CTE would perform better than one later down the track.

So, let’s get underway. I wasn’t able to easily use the benchmark function I wanted to use, so I repeated each query five times and took the average.

We need our views:

CREATE RECURSIVE VIEW node_tree (node_id, ancestors, cycle) AS (
  SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
  SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
  FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
);

CREATE MATERIALIZED VIEW node_tree_mat AS (SELECT * FROM node_tree);

Results

So, some results. I’ll show the query, and then the timing (from EXPLAIN ANALYZE).

#1

SELECT * FROM node INNER JOIN node_tree USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 54.9ms (stddev 1.99)

#2

SELECT * FROM node n, node_tree t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 57.2ms (stddev 2.98)

#3

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree WHERE 1 = ANY(ancestors));

Average Time: 58.5ms (stddev 3.67)

#4

SELECT * FROM node INNER JOIN node_tree_mat USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 12.2ms (stddev 0.80)

#5

SELECT * FROM node n, node_tree_mat t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 11.7ms (stddev 0.90)

This is the fastest query, but not significantly more so than #4.

#6

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree_mat WHERE 1 = ANY(ancestors));

Average Time: 24.0ms (stddev 0.41)

Interestingly, this is much slower than using a JOIN.

#7

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n INNER JOIN node_tree_cte t USING (node_id)
WHERE 1 = ANY(t.ancestors);

Average Time: 97.0ms (stddev 1.87)

Immediately, we see that using a CTE has a performance hit over using a VIEW. Unexpected.

#8

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n, node_tree_cte t
WHERE n.node_id = t.node_id AND 1 = ANY(t.ancestors);

Average Time: 96.1ms (stddev 1.16)

#9

SELECT * FROM node INNER JOIN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte USING (node_id);

Average Time: 114.3ms (stddev 4.64)

This is the slowest (but again, only just, and not significantly more than #10).

#10

SELECT * FROM node, (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte WHERE node.node_id = node_tree_cte.node_id;

Average Time: 110.0ms (stddev 1.05)

#11

SELECT * FROM node WHERE node_id IN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
);

Average Time: 96.1ms (stddev 5.50)

Discussion

So, it appears that Common Table Expressions are nearly twice as slow as using a RECURSIVE VIEW. I didn’t expect that at all, as I thought they were equivalent. Unsurprisingly, MATERIALIZED VIEW is much faster.

This has some implications for the stuff I was working on: I was using a query of the #11 form (a sub-query that is a WITH RECURSIVE statement), which, as it turns out is just as fast as doing a “root” CTE. However, it’s still far slower than doing a JOIN with a VIEW.

The problem I have now is that there is no way to have the addition of a Field to a Model in django to cause an extra migration operation to be added. One solution would be to manually add a RunSQL operation, but that is messy. I’ll also have to investigate costs of REFRESH MATERIALIZED VIEW.