Postgres VIEW from Django QuerySet

It’s already possible, given an existing Postgres (or other database) VIEW, to stick a Django Model in front of it, and have it fetch data from that instead of a table.

Creating the views can currently be done using raw SQL (and a RunSQL migration operation), or using some helpers to store the SQL in files for easy versioning.

It would be excellent if it was possible to use Django’s ORM to actually generate the VIEW, and even better if you could make the migration autodetector generate migrations.

But why would this be necessary? Surely, if you were able to create a QuerySet instance that contains the items in your view, that should be good enough?

Not quite, because currently using the ORM it is not possible to perform the following type of query:

SELECT foo.a,
       foo.b,
       bar.d
  FROM foo
  INNER JOIN (
    SELECT baz.a,
           ARRAY_AGG(baz.c) AS d
      FROM baz
     GROUP BY baz.a) bar ON (foo.a = bar.a)

That is, generating a join to a subquery is not possible in the ORM. In this case, you could probably get away with a correlated Subquery, however that would probably not perform as well as using a join in this case. This is because a subquery in a SELECT is evaluated once for each row, whereas a subquery join will be evaluated once.

So, we could use a VIEW for the subquery component:

CREATE OR REPLACE VIEW bar AS

SELECT baz.a,
       ARRAY_AGG(baz.c) AS d
  FROM baz
 GROUP BY baz.a;

And then stick a model in front of that, and join accordingly:

SELECT foo.a,
       foo.b,
       bar.d
  FROM foo
 INNER JOIN bar ON (foo.a = bar.a)

The Django model for the view would look something like:

class Bar(models.Model):
    a = models.OneToOneField(
        'foo.Foo',
        on_delete=models.DO_NOTHING,
        primary_key=True,
        related_name='bar'
    )
    d = django.contrib.postgres.fields.ArrayField(
        base_field=models.TextField()
    )

    class Meta:
        managed = False

The on_delete=models.DO_NOTHING is important: without it, a delete of a Foo instance would trigger an attempted delete of a Bar instance - which would cause a database error, because it’s coming from a VIEW instead of a TABLE.

Then, we’d be able to use:

queryset = Foo.objects.select_related('bar')

So, that’s the logic behind needing to be able to do a subquery, and it becomes even more compelling if you need that subquery/view to filter the objects, or perform some other expression/operation. So, how can we make Django emit code that will enable us to handle that?

There are two problems:

  • Turn a queryset into a VIEW.
  • Get the migration autodetector to trigger VIEW creation.

The other day I came across Create Table As Select in Django, and it made me realise that we can use basically the same logic for creating a view. So, we can create a migration operation that will perform this for us:

class CreateOrReplaceView(Operation):
    def __init__(self, view_name, queryset):
        self.view_name = view_name
        self.queryset = queryset

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        queryset = self.queryset
        compiler = queryset.query.get_compiler(using=schema_editor.connection.alias)
        sql, params = compiler.as_sql()
        sql = 'CREATE OR REPLACE VIEW {view} AS {sql}'.format(
            view=schema_editor.connection.ops.quote_name(self.view_name),
            sql=sql
        )
        schema_editor.execute(sql, params)

    def state_forwards(self, app_label, state):
        pass

We can then have this operation (which needs to be passed a queryset).

This doesn’t really solve how to define the queryset for the view, and have some mechanism for resolving changes made to that queryset (so we can generate a new migration if necessary). It also means we have a queryset written in our migration operation. We won’t be able to leave it like that: due to loading issues, you won’t be able to import model classes during the migration setup - and even if you could, you shouldn’t be accessing them during a migration anyway - you should use models from the ProjectState which is tied to where in the migration graph you currently are.

What would be excellent is if we could write something like:

class Bar(models.Model):
    a = models.OneToOneField(
        'foo.Foo',
        on_delete=models.DO_NOTHING,
        primary_key=True,
        related_name='bar',
    )
    d = django.contrib.postgres.fields.ArrayField(
        base_field=models.TextField()
    )

    class Meta:
        managed = False

    @property
    def view_queryset(self):
        return Baz.objects.values('a').annotate(d=ArrayAgg('c'))

And then, if we change our view definition:

@property
def view_queryset(self):
  return Baz.objects.values('a').filter(
      c__startswith='qux',
  ).annotate(
      d=ArrayAgg('c')
  )

… we would want a migration operation generated that includes the new queryset, or at least be able to know that it has changed. Ideally, we’d want to have a queryset attribute inside the Meta class in our model, which could itself be a property. However, that’s not possible without making changes to django itself.

In the meantime, we can borrow the pattern used by RunPython to have a callable that is passed some parameters during application of the migration, which returns the queryset. We can then have a migration file that looks somewhat like:

def view_queryset(apps, schema_editor):
    Baz = apps.get_model('foo', 'Baz')

    return Baz.objects.values('a').filter(
        c__startswith='qux'
    ).annotate(
        d=ArrayAgg('c')
    )


class Migration(migrations.Migration):
    dependencies = [
        ('foo', '0001_initial'),
    ]

    operations = [
        migrations.CreateModel(
            name='Bar',
            fields=[
                ('a', models.OneToOneField(...)),
                ('d', ArrayField(base_field=models.TextField(), ...)),
            ],
            options={
                'managed': False,
            }
        ),
        CreateOrReplaceView('Bar', view_queryset),
    ]

We still need to have the CreateModel statement so Django knows about our model, but the important bit in this file is the CreateOrReplaceView, which references the callable.

Now for the actual migration operation.

class CreateOrReplaceView(migrations.Operation):
    def __init__(self, model, queryset_factory):
        self.model = model
        self.queryset_factory = queryset_factory

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        model = from_state.apps.get_model(app_label, self.model)
        queryset = self.queryset_factory(from_state.apps, schema_editor)
        compiler = queryset.query.get_compiler(using=schema_editor.connection.alias)
        sql, params = compiler.as_sql()
        sql = 'CREATE OR REPLACE VIEW {view_name} AS {query}'.format(
            view_name=model._meta.db_table,
            query=sql,
        )
        schema_editor.execute(sql, params)

The backwards migration is not quite a solved problem yet: I do have a working solution that steps up the stack to determine what the current migration name is, and then finds the previous migration that contains one of these operations for this model, but that’s a bit nasty.


There’s no (clean) way to inject ourself into the migration autodetector and “notice” when we need to generate a new version of the view, however we can leverage the checks framework to notify the user when our view queryset is out of date compared to the latest migration.

from django.apps import apps
from django.core.checks import register

@register()
def check_view_definitions(app_configs, **kwargs):
    errors = []

    if app_configs is None:
        app_configs = apps.app_configs.values()

    for app_config in app_configs:
        errors.extend(_check_view_definitions(app_config))

    return errors

And then we need to implement _check_view_definitions:

def get_out_of_date_views(app_config):
    app_name = app_config.name

    view_models = [
        model
        # We need the real app_config, not the migration one.
        for model in apps.get_app_config(app_name.split('.')[-1]).get_models()
        if not model._meta.managed and hasattr(model, 'get_view_queryset')
    ]

    for model in view_models:
        latest = get_latest_queryset(model)
        current = model.get_view_queryset()

        if latest is None or current.query.sql_with_params() != latest.query.sql_with_params():
            yield MissingViewMigration(
                model,
                current,
                latest,
                Warning(W003.format(app_name=app_name, model_name=model._meta.model_name), id='sql_helpers.W003'),
            )


def _check_view_definitions(app_config):
    return [x.warning for x in get_out_of_date_views(app_config)]

The last puzzle piece there is get_latest_queryset, which is a bit more complicated:

def get_latest_queryset(model, before=None):
    from django.db.migrations.loader import MigrationLoader
    from django.db import connection

    migration_loader = MigrationLoader(None)
    migrations = dict(migration_loader.disk_migrations)
    migration_loader.build_graph()
    state = migration_loader.project_state()
    app_label = model._meta.app_label
    root_node = dict(migration_loader.graph.root_nodes())[app_label]
    # We want to skip any migrations in our reverse list until we have
    # hit a specific node: however, if that is not supplied, it means
    # we don't skip any.
    if not before:
        seen_before = True
    for node in migration_loader.graph.backwards_plan((app_label, root_node)):
        if node == before:
            seen_before = True
            continue
        if not seen_before:
            continue
        migration = migrations[node]
        for operation in migration.operations:
            if (
                isinstance(operation, CreateOrReplaceView) and
                operation.model.lower() == model._meta.model_name.lower()
            ):
                return operation.queryset_factory(state.apps, connection.schema_editor())

This also has code to allow us to pass in a node (before), which limits the search to migrations that occur before that node in the forwards migration plan.

Since we already have the bits in place, we could also have a management command that creates a stub migration (without the queryset factory, that’s a problem I haven’t yet solved). I’ve built this into my related “load SQL from files” app.


This is still a bit of a work in progress, but writing it down helped me clarify some concepts.

Postgres VIEW meet Django Model

Postgres VIEWs are a nice way to store a subset of a table in a way that can itself be queried, or perhaps slightly or radically changing the shape of your table. It has a fairly simple syntax:

CREATE VIEW "foo" AS
SELECT "bar", "baz", "qux"
FROM "corge"
WHERE "grault" IS NULL;

You may use any valid SELECT query as the source of a VIEW, including one that contains UNION or UNION ALL. You can use this form to create a view that takes two similarly formatted tables and combines them into one logical table. Note that for a UNION to work, the columns (and column types) must be identical between the two parts of the query. A UNION will do extra work to ensure all rows are unique: UNION ALL may perform better, especially if you know your rows will be unique (or you need duplicates).

By default, a Postgres VIEW is dynamic, and read-only. With the use of the CREATE MATERIALIZED VIEW form, it’s possible to have a cached copy stored on disk, which requires an UPDATE MATERIALIZED VIEW "viewname" in order to cause an update.

It’s also possible to create a writeable VIEW, but I’m not going to discuss those now.


There is a feature of Django that makes in really simple to use a VIEW as the read-only source of a Django Model subclass: managed = False.

Given the VIEW defined above, we can write a Model that will allow us to query it:

from django.db import models

class Foo(models.Model):
    bar = models.CharField()
    baz = models.CharField()
    qux = models.CharField()

    class Meta:
      managed = False

Psycopg2 also has the ability to automatically convert values as it fetches them, so you don’t even really need to set the fields as the correct type: but you will probably want to where possible, as an aid to code readability.

In my case, I was returning a two-dimensional ARRAY of TIMESTAMPTZ, but didn’t want to have to include the full code for an ArrayField. So, I just defined it as a CharField, and psycopg2 just gave me the type of object I actually wanted anyway.

There is one little catch, and the code above will not quite work.

Django requires a primary key, even though in this case it makes no sense. You could define any field as a primary key, include a relevant key field from the parent model, or even a dummy value that is the same on every row. Relying on the same psycopg2 trick as above, you could use <tablename>-<id> so as to ensure uniqueness, even though that is not normally a valid value for a Django AutoField.

You probably need to be a little careful here, as if you are doing comparisons, Django will test __class__ and pk for testing equality, so you could hurt yourself if you aren’t careful.

You may also want to prevent write access at the Django level. Overriding save() and delete() on the Model class would be a good start, as well as writing a custom Manager/QuerySet that does the same. You could raise an exception that makes sense, like NotImplemented, or just leave it as a database error.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')