QuerySets of various models

In general, the first thing we try to do to reduce the load time of a page in Django is to reduce the number of queries we are making to the database. This often has the greatest impact, sometimes orders of magnitude greater than other improvements.

One problem I recently hit with a specific set of pages was that there are potentially seven different models that may have zero or more items for a given request. This could mean we do seven queries that are all empty.

These are for all distinct models, but in this case, they are used for the same purpose, and this case, we always need all of them, if there are any relevant records. Whilst there are seven models, there may be more in the future.

Each of these models has a specific method, that validates some data against the rules that are applicable to that model, and the stored attributes. But each model has a different set of attributes, and future models will more than likely have different attributes.

There are at least three different ways we could have solved this problem. It turns out we have solved similar problems in the past the first two ways, and the third one is the one I’ve used in this case.


Solution 1: Store all data in the same model, and use a JSON field to store the attributes specific to this “class”.

This also requires a “type” field of some sort. Then, when loading data, we have the model use this type field to work out which attributes should apply.

This has a bunch of problems. First and foremost is that it becomes far more difficult to get the database (postgres, in this case) to apply integrity constraints. It’s not impossible, but it is much harder to read a constraint that checks a field and performs a JSON expression constraint. Changing a constraint, assuming it’s done using a check constraint and not a trigger, is still possible, but is likely to be harder to understand.

Secondly, it no longer becomes “automatic” to get Django Model Form support. It’s not impossible to use a Django Model Form, but you need to work a bit harder to get the initial data in, and ensure that the cleaned data for the fields is applied to the JSON field correctly.

Finally, as hinted above, using a “type” field means the logic for building the checks is more complex, unless you use class swizzling and proxy models or similar to have a different class for each type value. If an instance was accidentally updated to the wrong type, then all sorts of things could go wrong.

This was the first solution we used for our external integrations, and whilst convenient at some level, turned out to be much harder to manage than distinct model classes. It is not subject to the problem that is the basis of this article: we can always fetch objects of different logical types, as it’s all the same model class. Indeed, to only fetch a single class, we need to perform extra filtering.


Solution 2: Use concrete/multi-table inheritance.

This is the solution we moved to with our external integrations, and has been much nicer than the previous solution. Instead of having a JSON field with an arbitrary blob of data in it, we have distinct fields. This makes it much easier to have unique constraints, as well as requiring values, or values of a specific type. Knowing that the database is going to catch us accidentally putting a text value into the external id field for a system that requires an integer is reassuring.

This overcomes the second problem. We can now just use a Django ModelForm, and this makes writing views much easier. The validation for a given field or set of fields lives on the model, and where possible also in the database, as an exclusion or check constraint.

It also overcomes the third problem. We have distinct classes, which can have their own methods. We don’t need to try to use some magical proxy model code, and it’s easy for new developers to follow.

Finally, thanks to django-model-utils InheritanceManager, we can fetch all of our objects using the concrete parent model, and the .select_subclasses() method to downcast to our required class.

There are a couple of drawbacks to using concrete inheritance. Any fetch of an instance will perform a JOIN in your database, but more importantly, it’s impossible to perform a bulk_create() for models of these types.


Solution 3: Use a Postgres VIEW and JSONB to perform one query, and reconstruct models.

In the problem I’ve recently solved, we had a bunch of different models that, although being used in similar ways, didn’t have that much similarity. They were pre-existing, and it wasn’t worth the effort to move them to concrete inheritance, and using JSON fields for all data is not something I would repeat.

Instead, I came up with an idea that seems novel, based on some previous work I’ve done converting JSON data into models:

CREATE VIEW combined_objects AS

SELECT 'foo.Foo' AS model,
       foo.id,
       foo.tenant_id,
       TO_JSONB(foo) AS data
  FROM foo

 UNION ALL

SELECT 'foo.Bar' AS model,
       bar.id,
       baz.tenant_id,
       TO_JSONB(bar) AS data
  FROM bar
 INNER JOIN baz ON (baz.id = bar.baz_id)

 UNION ALL

SELECT 'qux.Qux' AS model,
       qux.id,
       tenant.id,
       TO_JSONB(qux) AS data
  FROM tenant
  JOIN qux ON (true)

This builds up a postgres view that contains all of the data from the model, but in a generic way. It also contains a tenant_id, which in this case was the mechanism that we’ll be using to filter the ones that are required at any given time. This can be a field on a model (as shown in the first subquery), or a field on a related model (as shown in the second). It could even be every object in a table for every tenant, as shown in the third.

From there, we need a model that will recreate the model instances correctly:

class CombinedObjectQuerySet(models.query.QuerySet):
    def for_tenant(self, tenant):
        return self.filter(tenant=tenant)

    def as_model_instances(self):
        return [x.instance for x in self]


class CombinedObject(models.Model):
    model = models.TextField()
    tenant = models.ForeignKey('tenant.Tenant')
    data = JSONField()

    objects = CombinedObjectQuerySet.as_manager()

    class Meta:
        managed = False
        db_table = 'combined_objects'

    def __str__(self):
        return '%s wrapper (%s)'.format(self.model, self.instance)

    def __eq__(self, other):
        return self.instance == other.instance

    @property
    def model_class(self):
        return apps.get_model(*self.model.split('.'))

    @cached_property
    def instance(self):
        return self.model_class(**self.data)

This works great, as long as you don’t apply additional annotations, typecast to python values, or want to deal with related objects. That is where it starts to get a bit tricky.

We can handle annotations and typecasting:

@cached_property
def instance(self):
    data = self.data
    model_class = self.model_class
    field_data = {
        field.name: field.to_python(data[field.name])
        for field in model_class
        if field.name in data
    }
    instance = model_class(**field_data)
    for attr, value in data.items():
        if attr not in field_data:
            setattr(instance, attr, value)
    return instance

There’s still the issue of foreign keys in the target models: in this case I know the code is not going to traverse these and trigger extra database hits. We could look at omitting those fields to prevent that being possible, but this is working well enough for now.

Postgres VIEW from Django QuerySet

It’s already possible, given an existing Postgres (or other database) VIEW, to stick a Django Model in front of it, and have it fetch data from that instead of a table.

Creating the views can currently be done using raw SQL (and a RunSQL migration operation), or using some helpers to store the SQL in files for easy versioning.

It would be excellent if it was possible to use Django’s ORM to actually generate the VIEW, and even better if you could make the migration autodetector generate migrations.

But why would this be necessary? Surely, if you were able to create a QuerySet instance that contains the items in your view, that should be good enough?

Not quite, because currently using the ORM it is not possible to perform the following type of query:

SELECT foo.a,
       foo.b,
       bar.d
  FROM foo
  INNER JOIN (
    SELECT baz.a,
           ARRAY_AGG(baz.c) AS d
      FROM baz
     GROUP BY baz.a) bar ON (foo.a = bar.a)

That is, generating a join to a subquery is not possible in the ORM. In this case, you could probably get away with a correlated Subquery, however that would probably not perform as well as using a join in this case. This is because a subquery in a SELECT is evaluated once for each row, whereas a subquery join will be evaluated once.

So, we could use a VIEW for the subquery component:

CREATE OR REPLACE VIEW bar AS

SELECT baz.a,
       ARRAY_AGG(baz.c) AS d
  FROM baz
 GROUP BY baz.a;

And then stick a model in front of that, and join accordingly:

SELECT foo.a,
       foo.b,
       bar.d
  FROM foo
 INNER JOIN bar ON (foo.a = bar.a)

The Django model for the view would look something like:

class Bar(models.Model):
    a = models.OneToOneField(
        'foo.Foo',
        on_delete=models.DO_NOTHING,
        primary_key=True,
        related_name='bar'
    )
    d = django.contrib.postgres.fields.ArrayField(
        base_field=models.TextField()
    )

    class Meta:
        managed = False

The on_delete=models.DO_NOTHING is important: without it, a delete of a Foo instance would trigger an attempted delete of a Bar instance - which would cause a database error, because it’s coming from a VIEW instead of a TABLE.

Then, we’d be able to use:

queryset = Foo.objects.select_related('bar')

So, that’s the logic behind needing to be able to do a subquery, and it becomes even more compelling if you need that subquery/view to filter the objects, or perform some other expression/operation. So, how can we make Django emit code that will enable us to handle that?

There are two problems:

  • Turn a queryset into a VIEW.
  • Get the migration autodetector to trigger VIEW creation.

The other day I came across Create Table As Select in Django, and it made me realise that we can use basically the same logic for creating a view. So, we can create a migration operation that will perform this for us:

class CreateOrReplaceView(Operation):
    def __init__(self, view_name, queryset):
        self.view_name = view_name
        self.queryset = queryset

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        queryset = self.queryset
        compiler = queryset.query.get_compiler(using=schema_editor.connection.alias)
        sql, params = compiler.as_sql()
        sql = 'CREATE OR REPLACE VIEW {view} AS {sql}'.format(
            view=schema_editor.connection.ops.quote_name(self.view_name),
            sql=sql
        )
        schema_editor.execute(sql, params)

    def state_forwards(self, app_label, state):
        pass

We can then have this operation (which needs to be passed a queryset).

This doesn’t really solve how to define the queryset for the view, and have some mechanism for resolving changes made to that queryset (so we can generate a new migration if necessary). It also means we have a queryset written in our migration operation. We won’t be able to leave it like that: due to loading issues, you won’t be able to import model classes during the migration setup - and even if you could, you shouldn’t be accessing them during a migration anyway - you should use models from the ProjectState which is tied to where in the migration graph you currently are.

What would be excellent is if we could write something like:

class Bar(models.Model):
    a = models.OneToOneField(
        'foo.Foo',
        on_delete=models.DO_NOTHING,
        primary_key=True,
        related_name='bar',
    )
    d = django.contrib.postgres.fields.ArrayField(
        base_field=models.TextField()
    )

    class Meta:
        managed = False

    @property
    def view_queryset(self):
        return Baz.objects.values('a').annotate(d=ArrayAgg('c'))

And then, if we change our view definition:

@property
def view_queryset(self):
  return Baz.objects.values('a').filter(
      c__startswith='qux',
  ).annotate(
      d=ArrayAgg('c')
  )

… we would want a migration operation generated that includes the new queryset, or at least be able to know that it has changed. Ideally, we’d want to have a queryset attribute inside the Meta class in our model, which could itself be a property. However, that’s not possible without making changes to django itself.

In the meantime, we can borrow the pattern used by RunPython to have a callable that is passed some parameters during application of the migration, which returns the queryset. We can then have a migration file that looks somewhat like:

def view_queryset(apps, schema_editor):
    Baz = apps.get_model('foo', 'Baz')

    return Baz.objects.values('a').filter(
        c__startswith='qux'
    ).annotate(
        d=ArrayAgg('c')
    )


class Migration(migrations.Migration):
    dependencies = [
        ('foo', '0001_initial'),
    ]

    operations = [
        migrations.CreateModel(
            name='Bar',
            fields=[
                ('a', models.OneToOneField(...)),
                ('d', ArrayField(base_field=models.TextField(), ...)),
            ],
            options={
                'managed': False,
            }
        ),
        CreateOrReplaceView('Bar', view_queryset),
    ]

We still need to have the CreateModel statement so Django knows about our model, but the important bit in this file is the CreateOrReplaceView, which references the callable.

Now for the actual migration operation.

class CreateOrReplaceView(migrations.Operation):
    def __init__(self, model, queryset_factory):
        self.model = model
        self.queryset_factory = queryset_factory

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        model = from_state.apps.get_model(app_label, self.model)
        queryset = self.queryset_factory(from_state.apps, schema_editor)
        compiler = queryset.query.get_compiler(using=schema_editor.connection.alias)
        sql, params = compiler.as_sql()
        sql = 'CREATE OR REPLACE VIEW {view_name} AS {query}'.format(
            view_name=model._meta.db_table,
            query=sql,
        )
        schema_editor.execute(sql, params)

The backwards migration is not quite a solved problem yet: I do have a working solution that steps up the stack to determine what the current migration name is, and then finds the previous migration that contains one of these operations for this model, but that’s a bit nasty.


There’s no (clean) way to inject ourself into the migration autodetector and “notice” when we need to generate a new version of the view, however we can leverage the checks framework to notify the user when our view queryset is out of date compared to the latest migration.

from django.apps import apps
from django.core.checks import register

@register()
def check_view_definitions(app_configs, **kwargs):
    errors = []

    if app_configs is None:
        app_configs = apps.app_configs.values()

    for app_config in app_configs:
        errors.extend(_check_view_definitions(app_config))

    return errors

And then we need to implement _check_view_definitions:

def get_out_of_date_views(app_config):
    app_name = app_config.name

    view_models = [
        model
        # We need the real app_config, not the migration one.
        for model in apps.get_app_config(app_name.split('.')[-1]).get_models()
        if not model._meta.managed and hasattr(model, 'get_view_queryset')
    ]

    for model in view_models:
        latest = get_latest_queryset(model)
        current = model.get_view_queryset()

        if latest is None or current.query.sql_with_params() != latest.query.sql_with_params():
            yield MissingViewMigration(
                model,
                current,
                latest,
                Warning(W003.format(app_name=app_name, model_name=model._meta.model_name), id='sql_helpers.W003'),
            )


def _check_view_definitions(app_config):
    return [x.warning for x in get_out_of_date_views(app_config)]

The last puzzle piece there is get_latest_queryset, which is a bit more complicated:

def get_latest_queryset(model, before=None):
    from django.db.migrations.loader import MigrationLoader
    from django.db import connection

    migration_loader = MigrationLoader(None)
    migrations = dict(migration_loader.disk_migrations)
    migration_loader.build_graph()
    state = migration_loader.project_state()
    app_label = model._meta.app_label
    root_node = dict(migration_loader.graph.root_nodes())[app_label]
    # We want to skip any migrations in our reverse list until we have
    # hit a specific node: however, if that is not supplied, it means
    # we don't skip any.
    if not before:
        seen_before = True
    for node in migration_loader.graph.backwards_plan((app_label, root_node)):
        if node == before:
            seen_before = True
            continue
        if not seen_before:
            continue
        migration = migrations[node]
        for operation in migration.operations:
            if (
                isinstance(operation, CreateOrReplaceView) and
                operation.model.lower() == model._meta.model_name.lower()
            ):
                return operation.queryset_factory(state.apps, connection.schema_editor())

This also has code to allow us to pass in a node (before), which limits the search to migrations that occur before that node in the forwards migration plan.

Since we already have the bits in place, we could also have a management command that creates a stub migration (without the queryset factory, that’s a problem I haven’t yet solved). I’ve built this into my related “load SQL from files” app.


This is still a bit of a work in progress, but writing it down helped me clarify some concepts.