Graphs in Django and Postgres
-
Comments:
- here.
I have written a bunch of posts about dealing with trees in Postgres and Django, and Funkybob used some of this to start the package django-closure-view.
Today, someone was looking for similar functionality, but for a graph. Specifically, a Directed Acyclic Graph. Now, not every graph, or even every DAG is a tree, but every tree is a DAG.
So, the difference between a tree and a graph in this context is that a given node may have an arbitrary number of parents. But, and this is worth noting now, none of it’s parents may also be dependencies.
The first part of this tells us that we can no longer just use a simple self-relation in our model to store the relationship: because there could be multiple parents. Instead, we will need to have a many-to-many relation to store that.
from django.db import models
class Node(models.Model):
node_id = models.AutoField(primary_key=True)
name = models.TextField(unique=True)
parents = models.ManyToManyField(
'self',
related_name='children',
symmetrical=False,
)
We can put some meaningful data into this graph to make it a little more obvious if our queries are sane:
django, pytz, sqlparse, asgiref = Node.objects.bulk_create([
Node(name='django'),
Node(name='pytz'),
Node(name='sqlparse'),
Node(name='asgiref'),
])
django.parents.add(pytz, sqlparse, asgiref)
graph_demo, psycopg2 = Node.objects.bulk_create([
Node(name='graph_demo'),
Node(name='psycopg2')
])
graph_demo.parents.add(psycopg2, django)
Let’s have a bit of a look at some of the queries we might need to think about.
-- All root nodes
SELECT node_id, name
FROM graph_node
LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
WHERE to_node_id IS NULL;
As expected, this gives us back all packages that have no dependencies (parents):
node_id │ name
─────────┼──────────
6 │ psycopg2
2 │ pytz
4 │ asgiref
3 │ sqlparse
(4 rows)
And now, all packages which are not depended upon by any other packages (no parents):
SELECT node_id, name
FROM graph_node
LEFT OUTER JOIN graph_node_parents ON (node_id = to_node_id)
WHERE from_node_id IS NULL;
We should only have one package here: graph_demo
.
From each of these, we can build up a recursive query to get all descendants, or all ancestors of each root/leaf node.
WITH RECURSIVE ancestors AS (
SELECT node_id, '{}'::INTEGER[] AS ancestors
FROM graph_node
LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
WHERE to_node_id IS NULL
UNION
SELECT node.from_node_id,
ancestors.ancestors || ancestors.node_id
FROM ancestors
INNER JOIN graph_node_parents node
ON (ancestors.node_id = to_node_id)
) SELECT * FROM ancestors;
From here, we can annotate on the names to double check:
WITH RECURSIVE ancestors AS (
SELECT node_id, '{}'::INTEGER[] AS ancestors
FROM graph_node
LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
WHERE to_node_id IS NULL
UNION
SELECT node.from_node_id,
ancestors.ancestors || ancestors.node_id
FROM ancestors
INNER JOIN graph_node_parents node
ON (ancestors.node_id = to_node_id)
)
SELECT node_id,
node.name,
ancestors,
ARRAY(SELECT name
FROM unnest(ancestors) node_id
INNER JOIN graph_node USING (node_id)
) AS ancestor_names
FROM ancestors
INNER JOIN graph_node node USING (node_id);
So that has given us all ancestor chains: but what about if we just want the closure table: all ancestor/descendant pairs?
WITH RECURSIVE closure_table AS (
SELECT from_node_id AS descendant,
to_node_id AS ancestor
FROM graph_node_parents
UNION
SELECT descendant,
to_node_id AS ancestor
FROM closure_table
INNER JOIN graph_node_parents ON (from_node_id = ancestor)
)
SELECT * FROM closure_table
Okay, that was even easier than the previous query.
Once we have our closure table query, then we can look at preventing cycles.
CREATE OR REPLACE RECURSIVE VIEW
graph_closure_table (descendant, ancestor) AS (
SELECT from_node_id AS descendant,
to_node_id AS ancestor
FROM graph_node_parents
UNION
SELECT descendant,
to_node_id AS ancestor
FROM graph_closure_table
INNER JOIN graph_node_parents ON (from_node_id = ancestor)
);
And we can now use this in a function to prevent cycles
CREATE OR REPLACE FUNCTION prevent_cycles()
RETURNS TRIGGER AS $$
BEGIN
IF EXISTS(SELECT 1
FROM graph_closure_table
WHERE ancestor = NEW.from_node_id
AND descendant = NEW.to_node_id
) THEN
RAISE EXCEPTION 'cycle detected';
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql STRICT;
CREATE TRIGGER prevent_cycles
BEFORE UPDATE OR INSERT ON graph_node_parents
FOR EACH ROW EXECUTE PROCEDURE prevent_cycles();
And this will prevent us from being able to set an invalid dependency relationship: ie, one that would trigger a cycle:
>>> django.parents.add(graph_demo)
Traceback (most recent call last):
File "...django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.RaiseException: cycle detected
CONTEXT: PL/pgSQL function prevent_cycles() line 9 at RAISE
It’s not totally ideal, but it does show how it protects against saving invalid relationships.
Interestingly, if we drop that constraint, we can still run the closure table query: it doesn’t give us an infinite loop, because the view uses a UNION
instead of a UNION ALL
: it’s going to drop any rows that are already in the output when it deals with each row - and since there are not an infinite number of combinations for a given set of dependencies, it will eventually return data.
So, where from here? I’m not sure. This was just something that I thought about while answering a question in IRC, and I felt like I needed to explore the idea.