Django Trees via Closure View

After writing up a method of using a Postgres View that generates a materialised path within the context of a Django model, I came across some queries of my data that were getting rather troublesome to write. It occurred to me that having a closure table would be useful. Specifically, I needed all of the descendants of a given set of nodes.

I couldn’t find an existing Postgres extension that will manage the closure table, and didn’t feel like writing my own implemention using triggers just yet. However, it occurred to me that I could use a similar trick to the recursive materialised path view. Thus, we have a Closure View.

We will start with the Django models:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.Node', related_name='children', null=True, blank=True)

    descendants = models.ManyToManyField('tree.Node', related_name='ancestors', through='tree.Closure')

    class Meta:
        app_label = 'tree'


class Closure(models.Model):
    path = ArrayField(base_field=models.IntegerField(), primary_key=True)
    ancestor = models.ForeignKey('tree.Node', related_name='+')
    descendant = models.ForeignKey('tree.Node', related_name='+')
    depth = models.IntegerField()

    class Meta:
        app_label = 'tree'
        managed = False

You may notice I have a path column. I’m using this for the primary key, and it may turn out to be useful later.

Let’s have a look at the View:

CREATE RECURSIVE VIEW tree_closure(path, ancestor_id, descendant_id, depth) AS

SELECT ARRAY[node_id], node_id, node_id, 0 FROM tree_node

UNION ALL

SELECT parent_id || path, parent_id, descendant_id, depth + 1
FROM tree_node INNER JOIN tree_closure ON (ancestor_id = node_id)
WHERE parent_id IS NOT NULL;

This uses a recursive query. The first part builds the self-reference relations, and the second part uses the RECURSIVE function to collect child nodes for each node already in the table (or added in previous iterations of this part of the view).

Now, because we are using the in-built Django Many to Many features, we have some nice queries ready to go:

  • node.ancestors.all() : All ancestors of a given Node instance.
  • node.descendants.all() : All descendants of a given Node instance.
  • Node.objects.filter(ancestors=queryset) : All descendants of all nodes in a queryset.
  • Node.objects.filter(descendants=queryset) : All ancestors of all nodes in a queryset.

Of particular note are the bottom two: these are rather cumbersome to write in older versions of Django.

Adjacency Lists in Django with Postgres

Today, I’m going to walk through modelling a tree in Django, using an Adjacency List, and a Postgres View that dynamically creates the materialised path of ancestors for each node.

With this, we will be able to query the tree for a range of operations using the Django ORM.

We will start with our model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.node', related_name='children', null=True, blank=True)

    class Meta:
        app_label = 'tree'

We will also build an unmanaged model that will be backed by our view.

from django.contrib.postgres.fields import ArrayField

class Tree(models.Model):
    root = models.ForeignKey(Node, related_name='+')
    node = models.OneToOneField(Node, related_name='tree_node', primary_key=True)
    ancestors = ArrayField(base_field=models.IntegerField())

    class Meta:
        app_label = 'tree'
        managed = False

You’ll notice I’ve included a root relation. This could be obtained by using ancestors[0] if ancestors else node_id, but that’s a bit cumbersome.

So, on to the View:

CREATE RECURSIVE VIEW tree_tree(root_id, node_id, ancestors) AS

SELECT node_id, node_id, ARRAY[]::INTEGER[]
FROM tree_node WHERE parent_id IS NULL

UNION ALL

SELECT tree.root_id, node.node_id, tree.ancestors || node.parent_id
FROM tree_node node INNER JOIN tree_tree tree ON (node.parent_id = tree.node_id)

I’ve written this view before, so I won’t go into any detail.

We can create a tree. Normally I wouldn’t specify the primary key, but since we want to talk about those values shortly, I will. It also means you can delete them, and recreate with this code, and not worry about the sequence values.

from tree.models import Node

Node.objects.bulk_create([
  Node(pk=1),
  Node(pk=2, parent_id=1),
  Node(pk=3, parent_id=1),
  Node(pk=4, parent_id=2),
  Node(pk=5, parent_id=2),
  Node(pk=6, parent_id=3),
  Node(pk=7, parent_id=3),
  Node(pk=8, parent_id=4),
  Node(pk=9, parent_id=8),
  Node(pk=10),
  Node(pk=11, parent_id=10),
  Node(pk=12, parent_id=11),
  Node(pk=13, parent_id=11),
  Node(pk=14, parent_id=12),
  Node(pk=15, parent_id=12),
  Node(pk=16, parent_id=12),
])

Okay, let’s start looking at how we might perform some operations on it.

We’ve already seen how to create a node, either root or leaf nodes. No worries there.

What about inserting an intermediate node, say between 11 and 12?

node = Node.objects.create(parent_id=11)
node.parent.children.exclude(pk=node.pk).update(parent=node)

I’m not sure if it is possible to do it in a single statement.

Okay, let’s jump to some tree-based statements. We’ll start by finding a sub-tree.

Node.objects.filter(tree_node__ancestors__contains=[2])

Oh, that’s pretty nice. It’s not necessarily sorted, but it will do for now.

We can also query directly for a root:

Node.objects.filter(tree_node__root=10)

We could spell that one as tree_node__ancestors__0=10, but I think this is more explicit. Also, that one will not include the root node itself.

Deletions are also simple: if we can build a queryset, we can delete it. Thus, deleting a full tree could be done by following any queryset by a .delete()

Fetching a node’s ancestors is a little trickier: because we only have an array of node ids; thus it does two queries.

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors)

The count of ancestors doesn’t require the second query:

len(Node.objects.get(pk=15).tree_node.ancestors)

Getting ancestors to a given depth is also simple, although it still requires two queries:

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors[-2:])

This is a fairly simple way to enable relatively performance-aware queries of tree data. There are still places where it’s not perfect, and in reality, you’d probably look at building up queryset or model methods for wrapping common operations.

slugify() for postgres (almost)

A recent discussion in #django suggested “what we need is a PG slugify function”.

The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.

Function slugify(value, allow_unicode=False).

  • Convert to ASCII if allow_unicode is false
  • Remove characters that aren’t alphanum, underscores, hyphens
  • Strip leading/trailing whitespace
  • Convert to lowercase
  • Convert spaces to hyphens
  • Remove repeated hyphens

(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).

We can write an SQL function that uses the Postgres unaccent extension to get pretty close:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\w\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";

$$ LANGUAGE SQL STRICT IMMUTABLE;

I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\w\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\s]+', '-', 'gi'
  ) FROM "normalized";

$$ LANGUAGE SQL STRICT IMMUTABLE;

To get the default value for the second argument, we can have an overloaded version with only a single argument:

CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;

Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:

BEGIN;

SELECT plan(7);

SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');

SELECT * FROM finish();

ROLLBACK;

And we get one failing test:

=# SELECT is(slugify('你好', true), '你好');

          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)

Time: 2.004 ms

It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.

Row Level Security in Postgres and Django

Postgres keeps introducing new things that pique my attention. One of the latest ones of these is Row Level Permissions, which essentially hides rows that a given database user cannot view. There’s a bit more to it than that, a good writeup is at Postgres 9.5 feature highlight: Row-Level Security and Policies.

However, it’s worth noting that the way Django connects to a database uses a single database user. Indeed, if your users table is in your database, then you’ll need some way to connect to it to authenticate. I haven’t come up with a nice way to use the Postgres users for authentication within Django just yet.

I did have an idea about a workflow that may just work.

  • Single Postgres User is used for authentication (Login User).
  • Every Django user gets an associated Postgres User (Session User), that may not log in.
  • This Session User is automatically created using a Postgres trigger, whenever the Django users table is updated.
  • After authentication, a SET SESSION ROLE (or SET SESSION AUTHORIZATION) statement is used to change to the correct Session User for the remainder of the session.

Then, we can implement the Postgres Row Level Security policies as required.

Initially, I had thought that perhaps the Session Users would have the same level of access as the Login User, however tonight it occurred to me that perhaps this could replace the whole Django permissions concept.


We do have a few things we need to work out before this is a done deal.

  • The trigger function that performs the CREATE ROLE statement when the django users table is updated.
  • Some mechanism of handling GRANT and REVOKE statements.
  • Similarly, some mechanism for showing current permissions for the given user.
  • A middleware that sets the SESSION USER according to the django user.

The simplest part of this is the last one, so we will start there. We can (in the meantime) manually create the users and their permissions to see how well it all goes. No point doing work that doesn’t work.

from django.db import connection


class SetSessionAuthorization(object):
    def process_view(self, request, *args, **kwargs):
        if request.user.pk:
          connection.cursor().execute(
            'SET SESSION SESSION AUTHORIZATION "django:{}"'.format(request.user.pk)
          )

We need to add this to our project’s middleware.

You’ll see we are using roles of the form django:<id>, which need to be quoted. We use the user id rather than the username, because usernames may be changed.

We’ll want to create a user for each of the existing Django users: I currently have a single user in this database, with id 1. I also have an existing SUPERUSER with the name django. We need to use a superuser if we are using SET SESSION AUTHORIZATION, which seems to be the best. I haven’t found anything which really does a good job of explaining how this and SET SESSION ROLE differ.

CREATE USER "django:1" NOLOGIN;
GRANT "django:1" TO django;
GRANT ALL ON ALL TABLES IN SCHEMA public TO public;
GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO public;

Note we have just for now enabled full access to all tables and sequences. This will remain until we find a good way to handle this.

We can start up a project using this, and see if it works. Unless I’ve missed something, then it should.

Next, we will turn on row-level-security for the auth_user table, and see if it works.

ALTER TABLE auth_user ENABLE ROW LEVEL SECURITY;

Then try to view the list of users (even as a django superuser). You should see an empty list.

We’ll turn on the ability to see our own user object:

CREATE POLICY read_own_data ON auth_user FOR
SELECT USING ('django:' || id = current_user);

Phew, that was close. Now we can view our user.

However, we can’t update it. Let’s fix that:

CREATE POLICY update_own_user_data ON auth_user FOR
UPDATE USING ('django:' || id = current_user)
WITH CHECK ('django:' || id = current_user);

We should be able to do some magic there to prevent a user toggling their own superuser status.

Let’s investigate writing a trigger function that creates a new ROLE when we update the django user.

CREATE OR REPLACE FUNCTION create_shadow_role()
RETURNS TRIGGER AS $$

BEGIN

  EXECUTE 'CREATE USER "django:' || NEW.id || '" NOLOGIN';
  EXECUTE 'GRANT "django:' || NEW.id || '" TO django';

  RETURN NULL;

END;

$$
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path =  public, pg_temp
VOLATILE;


CREATE TRIGGER create_shadow_role
  AFTER INSERT ON auth_user
  FOR EACH ROW
  EXECUTE PROCEDURE create_shadow_role();

Note we still can’t create users from the admin (due to the RLS restrictions thata are there), so we need to resort to ./manage.py createsuperuser again.

Having done that, we should see that our new user gets a ROLE:

# \du
                                        List of roles
 Role name │                         Attributes                         │      Member of
───────────┼────────────────────────────────────────────────────────────┼─────────────────────
 django    │ Superuser                                                  │ {django:1,django:6}
 django:1  │ Cannot login                                               │ {}
 django:6  │ Cannot login                                               │ {}
 matt      │ Superuser, Create role, Create DB                          │ {}
 postgres  │ Superuser, Create role, Create DB, Replication, Bypass RLS │ {}

We should be able to write similar triggers for update. We can, for example, shadow the Django is_superuser attribute to the Postgres SUPERUSER attribute. I’m not sure if that’s a super idea or not.

But we can write a simple function that allows us to see if the current django user is a superuser:

CREATE FUNCTION is_superuser()
RETURNS BOOLEAN AS $$

SELECT is_superuser
FROM auth_user
WHERE 'django:' || id = current_user

$$ LANGUAGE SQL;

We can now use this to allow superuser access to all records:

CREATE POLICY superuser_user_select ON auth_user
FOR SELECT USING (is_superuser);

CREATE POLICY superuser_user_update ON auth_user
FOR UPDATE USING (is_superuser)
WITH CHECK (is_superuser);

That gives us a fair bit of functionality. We still don’t have any mechanism for viewing or setting permissions. Because of the way Django’s permissions work, we can’t quite use the same trick but on the auth_user_user_permissions table, because we’d need to also look at the auth_user_groups table and auth_group_permissions.

I’m still not sure if this is a good idea or not, but it is a fun thought process.

Django second AutoField

Sometimes, your ORM just seems to be out to get you.

For instance, I’ve been investigating a technique for the most important data structure in a system to be essentially immuatable.

That is, instead of updating an existing instance of the object, we always create a new instance.

This requires a handful of things to be useful (and useful for querying).

  • We probably want to have a self-relation so we can see which object supersedes another. A series of objects that supersede one another is called a lifecycle.
  • We want to have a timestamp on each object, so we can view a snapshot at a given time: that is, which phase of the lifecycle was active at that point.
  • We should have a column that unique per-lifecycle: this makes for querying all objects of a lifecycle much simpler (although we can use a recursive query for that).
  • There must be a facility to prevent multiple heads on a lifecycle: that is, at most one phase of a lifecycle may be non-superseded.
  • The lifecycle phases needn’t be in the same order, or really have any differentiating features (like status). In practice they may, but for the purposes of this, they are just “what it was like at that time”.

I’m not sure these ideas will ever get into a released product, but the work behind them was fun (and all my private work).

The basic model structure might look something like:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = models.AutoField(primary_key=False, editable=False)

    superseded_by = models.OneToOneField('self',
        related_name='supersedes',
        null=True, blank=True, editable=False
    )
    timestamp = models.DateTimeField(auto_now_add=True)

    # Any other fields you might want...

    objects = PhaseQuerySet.as_manager()

So, that looks nice and simple.

Our second AutoField will have a sequence generated for it, and the database will give us a unique value from a sequence when we try to create a row in the database without providing this column in the query.

However, there is one problem: Django will not let us have a second AutoField in a model. And, even if it did, there would still be some problems. For instance, every time we attempt to create a new instance, every AutoField is not sent to the database. Which breaks our ability to keep the lifecycle_id between phases.

So, we will need a custom field. Luckily, all we really need is the SERIAL database type: that creates the sequence for us automatically.

class SerialField(object):
    def db_type(self, connection):
        return 'serial'

So now, using that field type instead, we can write a bit more of our model:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = SerialField(editable=False)
    superseded_by = models.OneToOneField('self', ...)
    timestamp = models.DateTimeField(auto_now_add=True)

    def save(self, **kwargs):
        self.pk = None
        super(Phase, self).save(**kwargs)

This now ensures each time we save our object, a new instance is created. The lifecycle_id will stay the same.

Still not totally done though. We currently aren’t handling a newly created lifecycle (which should be handled by the associated postgres sequence), nor are we marking the previous instance as superseded.

It’s possible, using some black magic, to get the default value for a database column, and, in this case, execute a query with that default to get the next value. However, that’s pretty horrid: not to mention it also runs an extra two queries.

Similarly, we want to get the phase_id of the newly created instance, and set that as the superseded_by of the old instance. This would require yet another query, after the INSERT, but also has the sinister side-effect of making us unable to apply the not-superseded-by-per-lifecycle requirement.

As an aside, we can investigate storing the self-relation on the other end - this would enable us to just do:

    def save(self, **kwargs):
        self.supersedes = self.pk
        self.pk = None
        super(Phase, self).save(**kwargs)

However, this turns out to be less useful when querying: we are much more likely to be interested in phases that are not superseded, as they are the “current” phase of each lifecycle. Although we could query, it would be running sub-queries for each row.

Our two issues: setting the lifecycle, and storing the superseding data, can be done with one Postgres BEFORE UPDATE trigger function:

CREATE FUNCTION lifecycle_and_supersedes()
RETURNS TRIGGER AS $$

  BEGIN
    IF NEW.lifecycle_id IS NULL THEN
      NEW.lifecycle_id = nextval('phase_lifecycle_id_seq'::regclass);
    ELSE
      NEW.phase_id = nextval('phase_phase_id_seq'::regclass);
      UPDATE app_phase
        SET superseded_by_id = NEW.phase_id
        WHERE group_id = NEW.group_id
        AND superseded_by_id IS NULL;
    END IF;
  END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER lifecycle_and_supersedes
  BEFORE INSERT ON app_phase
  FOR EACH ROW
  EXECUTE PROCEDURE lifecycle_and_supersedes();

So, now all we need to do is prevent multiple-headed lifecycles. We can do this using a UNIQUE INDEX:

CREATE UNIQUE INDEX prevent_hydra_lifecycles
ON app_phase (lifecycle_id)
WHERE superseded_by_id IS NULL;

Wow, that was simple.

So, we have most of the db-level code written. How do we use our model? We can write some nice queryset methods to make getting the various bits easier:

class PhaseQuerySet(models.query.QuerySet):
    def current(self):
        return self.filter(superseded_by=None)

    def superseded(self):
        return self.exclude(superseded_by=None)

    def initial(self):
        return self.filter(supersedes=None)

    def snapshot_at(self, timestamp):
        return filter(timestamp__lte=timestamp).order_by('lifecycle_id', '-timestamp').distinct('lifecycle_id')

The queries generated by the ORM for these should be pretty good: we could look at sticking an index on the lifecycle_id column.

There is one more thing to say on the lifecycle: we can add a model method to fetch the complete lifecycle for a given phase, too:

    def lifecycle(self):
        return self.model.objects.filter(lifecycle_id=self.lifecycle_id)

(That was why I used the lifecycle_id as the column).


Whilst building this prototype, I came across a couple of things that were also interesting. The first was a mechanism to get the default value for a column:

def database_default(table, column):
    cursor = connection.cursor()
    QUERY = """SELECT d.adsrc AS default_value
               FROM   pg_catalog.pg_attribute a
               LEFT   JOIN pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum)
                                                   = (d.adrelid,  d.adnum)
               WHERE  NOT a.attisdropped   -- no dropped (dead) columns
               AND    a.attnum > 0         -- no system columns
               AND    a.attrelid = %s::regclass
               AND    a.attname = %s"""
    cursor.execute(QUERY, [table, column])
    cursor.execute('SELECT {}'.format(*cursor.fetchone()))
    return cursor.fetchone()[0]

You can probably see why I didn’t want to use this. Other than the aforementioned two extra queries, it’s executing a query with data that comes back from the database. It may be possible to inject a default value into a table that causes it to do Very Bad Things™. We could sanitise it, perhaps ensure it matches a regular expression:

NEXTVAL = re.compile(r"^nextval\('(?P<sequence>[a-zA-Z_0-9]+)'::regclass\)$")

However, the trigger-based approach is nicer in every way.

The other thing I discovered, and this one is really nice, is a way to create an exclusion constraint that only applies if a column is NULL. For instance, ensure that no two classes for a given student overlap, but only if they are not superseded (or deleted).

ALTER TABLE "student_enrolments"
ADD CONSTRAINT "prevent_overlaps"
EXCLUDE USING gist(period WITH &&, student_id WITH =)
WHERE (
  superseded_by_id IS NULL
  AND
  status <> 'deleted'
);

Django Proxy Model Relations

I’ve got lots of code I’d do a different way if I were to start over, but often, we have to live with what we have.

One situation I would seriously reconsider is the structure I use for storing data related to how I interact with external systems. I have an Application object, and I create instances of this for each external system I interact with. Each new Application gets a UUID, and is created as part of a migration. Code in the system uses this UUID to determine if something is for that system.

But that’s not the worst of it.

I also have an AppConfig object, and other related objects that store a relation to an Application. This was fine initially, but as my code got more complex, I hit upon the idea of using Django’s Proxy models, and using the related Application to determine the subclass. So, I have AppConfig subclasses for a range of systems. This is nice: we can even ensure that we only get the right instances (using a lookup to the application to get the discriminator, which I’d probably do a different way next time).

However, we also have other bits of information that we need to store, that has a relation to this AppConfig object.

And here is where we run into problems. Eventually, I had the need to subclass these other objects, and deal with them. That gives a similar benefit to above for fetching filtered lists of objects, however when we try to follow the relations between these, something annoying happens.

Instead of getting the subclass of AppConfig, that we probably want to use because the business logic hangs off that, we instead get the actual AppConfig instances. So, in order to get the subclass, we have to fetch the object again, or swizzle the __class__. And, going back the other way would have the same problem.

Python is a dynamic language, so we should be able to do better.

In theory, all we have to do is replace the attributes on the two classes with ones that will do what we want them to do. In practice, we need to muck around a little bit more to make sure it all works out right.

It would be nice to be able to decorate the declaration of the overridden field, but that’s not valid python syntax:

>>> class Foo(object):
...   @override
...   bar = object()
  File "<stdin>", line 3
    bar = object()

So, we’ll have to do one of two things: alter the class after it has been defined, or leverage the metaclass magic Django already does.

class Foo(models.Model):
    bar = models.ForeignKey('bar.Bar')


class FooProxy(models.Model):
    bar = ProxyForeignKey('bar.BarProxy')  # Note the proxy class

    class Meta:
      proxy = True

However, we can’t just use the contribute_to_class(cls, name) method as-is, as the Proxy model attributes get dealt with before the parent model. So, we need to register a signal, and get the framework to run our code after the class has been prepared:

class ProxyField(object):
    def __init__(self, field):
        self.field = field

    def contribute_to_class(self, model, name):
        @receiver(models.signals.class_prepared, sender=model, weak=False)
        def late_bind(sender, *args, **kwargs):
          override_model_field(model, name, self.field)


class ProxyForeignKey(ProxyField):
    def __init__(self, *args, **kwargs):
        super(ProxyForeignKey, self).__init__(ForeignKey(*args, **kwargs))

Then, it’s a matter of working out what needs to happen to override_model_field.

It turns out: not much. Until we start thinking about edge cases, anyway:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    model.add_to_class(name, field)
    if field.rel:
      field.rel.to.add_to_class(
        field.related_name,
        ForeignRelatedObjectsDescriptor(field.related)
      )

There is a little more to it than that:

  • We need to use the passed-in related_name if one was provided in the new field definition, else we want to use what the original field’s related name was. However, if not explicitly set, then neither field will actually have a related_name attribute.
  • We cannot allow an override of a foreign key to a non-proxy model: that would hijack the original model’s related queryset.
  • Similarly, we can only allow a single proxy to override-relate to another proxy: any subsequent override-relations would likewise hijack the related object queryset.
  • For non-related fields, we can only allow an override if the field is compatible. What that means I’m not completely sure just yet, but for now, we will only allow the same field class (or a subclass). Things that would require a db change would be verboten.

So, we need to guard our override_model_field somewhat:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    if not isinstance(field, original_field.__class__):
        raise TypeError('...')

    # Must do these checks before the `add_to_class`, otherwise it breaks tests.
    if field.rel:
        if not field.rel.to._meta.proxy:
            raise TypeError('...')

        related_name = getattr(field, 'related_name', original_field.related.get_accessor_name())
        related_model = getattr(field.rel.to, related_name).related.model

        # Do we already have an overridden relation to this model?
        if related_model._meta.proxy:
            raise TypeError('...')

    model.add_to_class(name, field)

    if field.rel:
        field.rel.to.add_to_class(
          related_name,
          ForeignRelatedObjectsDescriptor(field.related)
        )

There is an installable app that includes tests: django-proxy-overrides.

Read-only data from your database in Django

I had the need to create a column in some database tables that are completely controlled by the database, but the value of which is sometimes needed by the Django object.

It should never be presented in a Form, and never, ever be written to by the Django infrastructure.

So, we need a way to fetch the data from the database, but, even if the value is changed, and the object saved, is not written back.

The detail of how this data is set in the database is irrelevant: it’s a column that gets it’s value from a sequence (and, incidentally, this sequence is shared across multiple tables).

But, we need a way to get this data.

A nice technique is to leverage two parts of Django: the QuerySet.extra(select={}) method to actually add this field to the query, and Manager.get_query_set() (get_queryset() in older versions of Django) to make this apply to every fetch of the objects.

Our extra column will be called sequence_number

class SequenceNumberManager(models.managers.Manager):
    def get_query_set(self):
      return super(SequenceNumberManager, self).get_query_set().extra(select={
        'sequence_number': '"%s"."sequence_number"' % self.model._meta.db_table
      })

class Thing(models.Model):
    # Column definitions. Do not define sequence_number!

    objects = SequenceNumberManager()

That’s it.

Now, Thing.objects.all()[0].sequence_number will give you the value from the database. Because it’s not a Django field, it will never be written back to the database.

The SequenceNumberManager, as it stands, could be applied to multiple models, as it dynamically looks for the database table that should be used for the field lookup. You need to define the table name, as otherwise if you join to another table with the same column name, your database will (correctly) reject the query, as it is ambiguous to which table it refers.

I’ve done a similar thing, but using django-model-utilsPassThroughManager in conjunction with a QuerySet subclass, since I’m a big fan of the approach described at Building a higher-level query API: the right way to use Django’s ORM, which I link to at least once a week in IRC.

I’d be interested to hear of alternative methods of achieving this, or if there are any possible drawbacks. I think you could do it with a custom field (that never wrote to the database), but I think that would actually be a whole lot more code. There is also the benefit in this case that you could have the .extra() call as a queryset method, so you only add it when you need it, although the performance gains there would be insignificant, I suspect.

Using Postgres Composite Types in Django

Note: this post turned out to be far more complicated than I had hoped. I may write another one that deals with a less complicated type!

Postgres comes with a pretty large range of column types, and the ability to use these types in an ARRAY. There’s also JSON(B) and Hstore, which are useful for storing structured (but possibly varying) data. Additionally, there are also a range of, well, range types.

However, sometimes you actually want to store data in a strict column, but that isn’t a simple scalar type, or one of the standard range types. Postgres allows you to define your own composite types.

There is a command CREATE TYPE that can be used to create an arbitrary type. There are four forms: for now we will just look at Composite Types.

We will create a Composite type that represents the opening hours for a store, or more specifically, the default opening hours. For instance, a store may have the following default opening hours:

+------------+--------+---------+
|    Day     |  Open  |  Close  |
+------------+--------+---------+
|  Monday    |  9 am  |  5 pm   |
|  Tuesday   |  9 am  |  5 pm   |
|  Wednesday |  9 am  |  5 pm   |
|  Thursday  |  9 am  |  9 pm   |
|  Friday    |  9 am  |  5 pm   |
|  Saturday  | 10 am  |  5 pm   |
|  Sunday    | 11 am  |  5 pm   |
+------------+--------+---------+

During the Christmas season this store may be open longer (perhaps even 24 hours). There may also be differences at Easter time, or other public holidays, where the store is closed, or closes early.

It would be nice to be able to store the default opening hours for a store, and then, when creating a week, use these to create concrete (TIMESTAMP) values for each day, which could be overridden on any given day.

There are a few ways we could model this. Postgres has no timerange type, so that’s out. We could create a RANGE type, or we could use (start-time, finish-time). But what about when a store is open after midnight, or for 24 hours? Storing this data implicitly is a real pain, because you need to always check to see if the finish time is less than (or equal to) the start time whenever doing anything. Trust me, this is not the best approach.

An alternative I’ve been toying with is (start-time, interval). You could limit it so that the interval’s maximum is '1 day', but not (from what I can tell) when you define the type. Anyway, the syntax for creating this type is:

CREATE TYPE opening_hours AS (
  start time,
  length interval
);

As an aside, every table in the database also has an associated type (of the same name as the table).

Now, we have our type: we can use it in a table:

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT
);

CREATE TABLE default_opening_hours (
  store_id INTEGER REFERENCES store (store_id),
  monday opening_hours,
  tuesday opening_hours,
  wednesday opening_hours,
  thursday opening_hours,
  friday opening_hours,
  saturday opening_hours,
  sunday opening_hours
);

An alternative way of storing this information might be to use an array of opening_hours, directly on the store model. We’ll use this one instead, as it’s a little neater (and means we will look at how to use opening_hours[] later too).

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT,
  default_opening_hours opening_hours[7]
);

Now, we can put data in there:

INSERT INTO store (name, default_opening_hours) VALUES
(
  'John Martins',
  ARRAY[
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '12:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('10:00', '07:00')::opening_hours,
    ('11:00', '06:00')::opening_hours
  ]
);

Note how we need to cast all of the values from record to opening_hours.


In practice, we would probably also want to have some type of restriction where the opening time from one day, plus the default open hours is less than or equal to the starting time on the next day. I’m still not sure of the best way to do this in Postgres, but it is possible to do it in Django.


Speaking of Django, we want to be able to access this data type there. We can leverage a really nice feature of Psycopg2 to have these values automatically turned into a Python namedtuple. We do this by registering the type within Psycopg2, using the Django cursor.

from django.db import connection
from psycopg2.extras import register_composite

register_composite('opening_hours', connection.cursor().cursor)

But, this is only half of the pattern. We also need to register an adapter so that values going back the other way are also automatically cast into opening_hours.

from django.db import connection
from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs

# Get a reference to the namedtuple class
OpeningHours = register_composite(
  'opening_hours',
  connection.cursor().cursor,
  globally=True
).type

def adapt_opening_hours(value):
  return AsIs("(%s, %s)::opening_hours" % (
    adapt(value.start).getquoted(),
    adapt(value.length).getquoted()
  ))

register_adapter(OpeningHours, adapt_opening_hours)

Now, we can fetch data from the database, and know that we will get OpeningHours instances, and, when passing an OpeningHours instance back to the database, know it will be converted into the correct type.

Obviously, in order to do this, the type must exist in the database. We did that manually in this case. In a real situation you would want to do that as a database migration. And that is where things get tricky. You can’t run the register_adapter function until the type exists in the database. I did come up with a relatively neat workaround for this when writing a framework for generic Composite fields, where the registration of the composite type attempts to execute, and if it fails, it stores the data for later registration, and then the actual migration operation fires off a signal, which is handled by a listener that actually performs the registration.

The final piece of the puzzle is the Django Field subclass, which is actually not that complicated. In essence, we are relying on Psycopg to handle the adaptation in both directions, so it can be a bare field (perhaps with a formfield method to get a custom form field). In practice, I wrote the generic CompositeField subclass, which uses some metaclass magic to handle the late registration:

from django.db.models import fields
from django.db import connection
from django.dispatch import receiver, Signal

from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs
from psycopg2 import ProgrammingError


_missing_types = {}

class CompositeMeta(type):
    def __init__(cls, name, bases, clsdict):
        super(CompositeMeta, cls).__init__(name, bases, clsdict)
        cls.register_composite()

    def register_composite(cls):
        db_type = cls().db_type(connection)
        if db_type:
            try:
                cls.python_type = register_composite(
                    db_type,
                    connection.cursor().cursor,
                    globally=True
                ).type
            except ProgrammingError:
                _missing_types[db_type] = cls
            else:
                def adapt_composite(composite):
                    return AsIs("(%s)::%s" % (
                        ", ".join([
                            adapt(getattr(composite, field)).getquoted() for field in composite._fields
                        ]), db_type
                    ))

                register_adapter(cls.python_type, adapt_composite)


class CompositeField(fields.Field):
    __metaclass__ = CompositeMeta
    """
    A handy base class for defining your own composite fields.

    It registers the composite type.
    """


composite_type_created = Signal(providing_args=['name'])

@receiver(composite_type_created)
def register_composite_late(sender, db_type, **kwargs):
    _missing_types.pop(db_type).register_composite()

We also want to have a custom migration operation:

from django.db.migrations.operations.base import Operation

# Or wherever the code above is located.
from .fields.composite import composite_type_created


class CreateCompositeType(Operation):
    def __init__(self, name=None, fields=None):
        self.name = name
        self.fields = fields

    @property
    def reversible(self):
        return True

    def state_forwards(self, app_label, state):
        pass

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('CREATE TYPE %s AS (%s)' % (
            self.name, ", ".join(["%s %s" % field for field in self.fields])
        ))
        composite_type_created.send(sender=self.__class__, db_type=self.name)

    def state_backwards(self, app_label, state):
        pass

    def database_backwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('DROP TYPE %s' % self.name)

This is a bit manual, however. You need to create your own migration that creates the composite type, and then begin to use the field.

# migrations/XXXX_create_opening_hours.py

class Migration(migrations.Migration):
    dependencies = []

    operations = [
        CreateCompositeType(
            name='opening_hours',
            fields=[
                ('start', 'time'),
                ('length', 'interval')
            ],
        ),
    ]

The place this pattern falls down is that this migration must be manually created: we don’t have any way to automatically create the migration from the Field subclass, which just looks like:

class OpeningHoursField(CompositeField):

    def db_type(self, connection):
        return 'opening_hours'

    def formfield(self, **kwargs):
        defaults = {
            'form_class': OpeningHoursFormField
        }
        defaults.update(**kwargs)
        return super(OpeningHoursField, self).formfield(**defaults)

I think in the future I’ll attempt to use further metaclass magic to allow defining the fields of the Composite type. This could then be used to automatically create a form field (which is a subclass of forms.MultiValueField).

class OpeningHoursField(CompositeField):
    start = models.DateField()
    length = IntervalField()

    def db_type(self, connection):
        return 'opening_hours'

However, in the meantime, we can still get by. I’m not sure it’s going to be possible to inject extra operations into the migration based upon the field types anyway.

Finally, we can use this in a model:

class Store(models.Model):
    store_id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=128)
    default_opening_hours = ArrayField(
        base_field=OpeningHoursField(null=True, blank=True),
        size=7
    )

I’ve used the ArrayField from django.contrib.postgres, purely for illustration purposes.

The CompositeField and associated operation are part of my django-postgres project: once I have worked out some more kinks, I may submit a pull request to django.contrib.postgres, unless someone else beats me to it.

Oh, and a juicy little extra. Above I mentioned something about preventing overlaps. The logic I use in my form is:

from django import forms
from django.utils.translation import string_concat, ugettext_lazy as _

import postgres.forms

from .fields import OpeningHoursFormField
from .models import Store


def finish(obj):
    "Given an OpeningHours value, get the finish time"
    date = datetime.date(1, 1, 1)
    return (datetime.datetime.combine(date, obj.start) + obj.duration).time()


class StoreForm(forms.ModelForm):
    OVERLAPS_PREVIOUS = _('Open hours overlap previous day.')

    default_opening_hours = postgres.forms.SplitArrayField(
        base_field=OpeningHoursFormField(required=False),
        size=7,
    )

    class Meta:
        model = Store

    def clean_default_opening_hours(self):
        opening_hours = self.cleaned_data['default_opening_hours']
        field = self.fields['default_opening_hours']

        # Ensure consecutive days do not overlap.
        errors = []

        for i in range(7):
            today = opening_hours[i]
            if today.start is None or today.duration is None:
                continue

            yesterday = opening_hours[(i + 6) % 7]

            if yesterday.start is None or yesterday.duration is None:
                continue

            if finish(yesterday) <= yesterday.start:
                if today.start < finish(yesterday):
                    errors.append(forms.ValidationError(
                        string_concat(
                          field.error_messages['item_invalid'],
                          self.OVERLAPS_PREVIOUS
                        ),
                        code='item_invalid',
                        params={'nth': i}
                    ))

        if errors:
            raise forms.ValidationError(errors)

        return opening_hours

I’m currently not displaying the duration/length: I dynamically calculate it based on the entered start/finish pair, but that’s getting quite complicated.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')

Postgres VIEW meet Django Model

Postgres VIEWs are a nice way to store a subset of a table in a way that can itself be queried, or perhaps slightly or radically changing the shape of your table. It has a fairly simple syntax:

CREATE VIEW "foo" AS
SELECT "bar", "baz", "qux"
FROM "corge"
WHERE "grault" IS NULL;

You may use any valid SELECT query as the source of a VIEW, including one that contains UNION or UNION ALL. You can use this form to create a view that takes two similarly formatted tables and combines them into one logical table. Note that for a UNION to work, the columns (and column types) must be identical between the two parts of the query. A UNION will do extra work to ensure all rows are unique: UNION ALL may perform better, especially if you know your rows will be unique (or you need duplicates).

By default, a Postgres VIEW is dynamic, and read-only. With the use of the CREATE MATERIALIZED VIEW form, it’s possible to have a cached copy stored on disk, which requires an UPDATE MATERIALIZED VIEW "viewname" in order to cause an update.

It’s also possible to create a writeable VIEW, but I’m not going to discuss those now.


There is a feature of Django that makes in really simple to use a VIEW as the read-only source of a Django Model subclass: managed = False.

Given the VIEW defined above, we can write a Model that will allow us to query it:

from django.db import models

class Foo(models.Model):
    bar = models.CharField()
    baz = models.CharField()
    qux = models.CharField()

    class Meta:
      managed = False

Psycopg2 also has the ability to automatically convert values as it fetches them, so you don’t even really need to set the fields as the correct type: but you will probably want to where possible, as an aid to code readability.

In my case, I was returning a two-dimensional ARRAY of TIMESTAMPTZ, but didn’t want to have to include the full code for an ArrayField. So, I just defined it as a CharField, and psycopg2 just gave me the type of object I actually wanted anyway.

There is one little catch, and the code above will not quite work.

Django requires a primary key, even though in this case it makes no sense. You could define any field as a primary key, include a relevant key field from the parent model, or even a dummy value that is the same on every row. Relying on the same psycopg2 trick as above, you could use <tablename>-<id> so as to ensure uniqueness, even though that is not normally a valid value for a Django AutoField.

You probably need to be a little careful here, as if you are doing comparisons, Django will test __class__ and pk for testing equality, so you could hurt yourself if you aren’t careful.

You may also want to prevent write access at the Django level. Overriding save() and delete() on the Model class would be a good start, as well as writing a custom Manager/QuerySet that does the same. You could raise an exception that makes sense, like NotImplemented, or just leave it as a database error.