QuerySets of various models

2020-03-09 @ 10:36:53

here.

In general, the first thing we try to do to reduce the load time of a page in Django is to reduce the number of queries we are making to the database. This often has the greatest impact, sometimes orders of magnitude greater than other improvements.

One problem I recently hit with a specific set of pages was that there are potentially seven different models that may have zero or more items for a given request. This could mean we do seven queries that are all empty.

These are for all distinct models, but in this case, they are used for the same purpose, and this case, we always need all of them, if there are any relevant records. Whilst there are seven models, there may be more in the future.

Each of these models has a specific method, that validates some data against the rules that are applicable to that model, and the stored attributes. But each model has a different set of attributes, and future models will more than likely have different attributes.

There are at least three different ways we could have solved this problem. It turns out we have solved similar problems in the past the first two ways, and the third one is the one I’ve used in this case.

Solution 1: Store all data in the same model, and use a JSON field to store the attributes specific to this “class”.

This also requires a “type” field of some sort. Then, when loading data, we have the model use this type field to work out which attributes should apply.

This has a bunch of problems. First and foremost is that it becomes far more difficult to get the database (postgres, in this case) to apply integrity constraints. It’s not impossible, but it is much harder to read a constraint that checks a field and performs a JSON expression constraint. Changing a constraint, assuming it’s done using a check constraint and not a trigger, is still possible, but is likely to be harder to understand.

Secondly, it no longer becomes “automatic” to get Django Model Form support. It’s not impossible to use a Django Model Form, but you need to work a bit harder to get the initial data in, and ensure that the cleaned data for the fields is applied to the JSON field correctly.

Finally, as hinted above, using a “type” field means the logic for building the checks is more complex, unless you use class swizzling and proxy models or similar to have a different class for each type value. If an instance was accidentally updated to the wrong type, then all sorts of things could go wrong.

This was the first solution we used for our external integrations, and whilst convenient at some level, turned out to be much harder to manage than distinct model classes. It is not subject to the problem that is the basis of this article: we can always fetch objects of different logical types, as it’s all the same model class. Indeed, to only fetch a single class, we need to perform extra filtering.

Solution 2: Use concrete/multi-table inheritance.

This is the solution we moved to with our external integrations, and has been much nicer than the previous solution. Instead of having a JSON field with an arbitrary blob of data in it, we have distinct fields. This makes it much easier to have unique constraints, as well as requiring values, or values of a specific type. Knowing that the database is going to catch us accidentally putting a text value into the external id field for a system that requires an integer is reassuring.

This overcomes the second problem. We can now just use a Django ModelForm, and this makes writing views much easier. The validation for a given field or set of fields lives on the model, and where possible also in the database, as an exclusion or check constraint.

It also overcomes the third problem. We have distinct classes, which can have their own methods. We don’t need to try to use some magical proxy model code, and it’s easy for new developers to follow.

Finally, thanks to django-model-utils InheritanceManager, we can fetch all of our objects using the concrete parent model, and the .select_subclasses() method to downcast to our required class.

There are a couple of drawbacks to using concrete inheritance. Any fetch of an instance will perform a JOIN in your database, but more importantly, it’s impossible to perform a bulk_create() for models of these types.

Solution 3: Use a Postgres VIEW and JSONB to perform one query, and reconstruct models.

In the problem I’ve recently solved, we had a bunch of different models that, although being used in similar ways, didn’t have that much similarity. They were pre-existing, and it wasn’t worth the effort to move them to concrete inheritance, and using JSON fields for all data is not something I would repeat.

Instead, I came up with an idea that seems novel, based on some previous work I’ve done converting JSON data into models:

CREATE VIEW combined_objects AS

SELECT 'foo.Foo' AS model,
       foo.id,
       foo.tenant_id,
       TO_JSONB(foo) AS data
  FROM foo

 UNION ALL

SELECT 'foo.Bar' AS model,
       bar.id,
       baz.tenant_id,
       TO_JSONB(bar) AS data
  FROM bar
 INNER JOIN baz ON (baz.id = bar.baz_id)

 UNION ALL

SELECT 'qux.Qux' AS model,
       qux.id,
       tenant.id,
       TO_JSONB(qux) AS data
  FROM tenant
  JOIN qux ON (true)

This builds up a postgres view that contains all of the data from the model, but in a generic way. It also contains a tenant_id, which in this case was the mechanism that we’ll be using to filter the ones that are required at any given time. This can be a field on a model (as shown in the first subquery), or a field on a related model (as shown in the second). It could even be every object in a table for every tenant, as shown in the third.

From there, we need a model that will recreate the model instances correctly:

class CombinedObjectQuerySet(models.query.QuerySet):
    def for_tenant(self, tenant):
        return self.filter(tenant=tenant)

    def as_model_instances(self):
        return [x.instance for x in self]


class CombinedObject(models.Model):
    model = models.TextField()
    tenant = models.ForeignKey('tenant.Tenant')
    data = JSONField()

    objects = CombinedObjectQuerySet.as_manager()

    class Meta:
        managed = False
        db_table = 'combined_objects'

    def __str__(self):
        return '%s wrapper (%s)'.format(self.model, self.instance)

    def __eq__(self, other):
        return self.instance == other.instance

    @property
    def model_class(self):
        return apps.get_model(*self.model.split('.'))

    @cached_property
    def instance(self):
        return self.model_class(**self.data)

This works great, as long as you don’t apply additional annotations, typecast to python values, or want to deal with related objects. That is where it starts to get a bit tricky.

We can handle annotations and typecasting:

@cached_property
def instance(self):
    data = self.data
    model_class = self.model_class
    field_data = {
        field.name: field.to_python(data[field.name])
        for field in model_class
        if field.name in data
    }
    instance = model_class(**field_data)
    for attr, value in data.items():
        if attr not in field_data:
            setattr(instance, attr, value)
    return instance

There’s still the issue of foreign keys in the target models: in this case I know the code is not going to traverse these and trigger extra database hits. We could look at omitting those fields to prevent that being possible, but this is working well enough for now.