Why CustomUser subclasses are not such a good idea

Background

The system I work on has People who may or may not be Users, and very infrequently Users who may not be a Person. In fact, an extension to the system has meant that there will be more of these: a User who needs to be able to generate reports (say, a Franchisor who needs to only be able to access aggregate data from franchises, that might belong to multiple companies) who is never rostered on for shifts, which is what the Person class is all about.

Anyway, the long and the short of this was that I thought it might be a good idea to look at sub-classing User for ManagementUser.

I guess I should have listened to those smarter than me who shouted that sub-classing User is not cool. Although they never gave any concrete reasons, but now I have one.

You cannot easily convert a superclass object to a specialised sub-class. Once a user is a User, it’s hard to make them into a ManagementUser.

It can be done: the following code will take a User (or any parent class) object, a User (or whatever) subclass, and any other keyword arguments that should be passed into the constructor. It saves the newly upgraded object, and returns it.

1 def create_subclass(SubClass, old_instance, **kwargs):
2     new_instance = SubClass()
3     for field in old_instance._meta.local_fields:
4         setattr(new_instance, field.name, getattr(old_instance, field.name))
5     new_instance.save()
6     return new_instance()

However, it really should check that there isn’t an existing instance, and maybe some other checks.

What advantages does sub-classing have?

The biggest advantage, or so I thought, was to have it so you can automatically downcast your models on user login, and then get access to the extended user details. For instance, if your authentication backend automatically converts User to Person, then you can get access to the Person’s attributes (like the company they work for, their shifts, etc) without an extra level of attribute access:

1 # request.user is always an auth.User instance:
2 request.user.person.company
3 # request.user might be a person, etc.
4 request.user.company

But it turns out that even this is bad. Now, in guard decorators on view functions, you cannot just test the value of an attribute, as not all users will have that attribute. Instead, you need to test to see if the attribute exists, and then test the attribute itself.

So, what do you do instead?

The preferred method in django for extending User is to use a UserProfile class. This is just a model that has a OneToOneField linked back to User. I would look at doing a very small amount of duck-punching just to make getting a hold of the profile class:

 1 import logging
 2 from django.contrib.auth.models import User
 3 from django.db import models
 4 
 5 class Person(models.Model):
 6     user = models.OneToOneField(User, related_name="_person")
 7     date_of_birth = models.DateField(null=True, blank=True)
 8 
 9 def get_person(user):
10     try:
11         return user._person
12     except Person.DoesNotExist:
13         pass
14 
15 def set_person(user, person):
16     user._person = person
17 
18 if hasattr(User, 'person'):
19     logging.error('Model User already has an attribute "person".')
20 else:
21     User.person = property(get_person, set_person)

By having the person’s related name attribute as _person, we can wrap read access to it in an exception handler, and then use a view decorator like:

1 @user_passes_test(lambda u:u.person)
2 def person_only_view(request, **kwargs):
3     pass

We know this view will only be available to logged in users who have a related Person object.

I will point out that I am duck-punching/monkey-patching here. However, I feel that this particular method of doing it is relatively safe. I check before adding the property, and in reality I probably would raise an exception rather than just log an error.

Postgres and Django

Frank Wiles gave a great talk Secrets of PostgreSQL Performance

Don’t do dumb things

  • Dedicate a single server to your database
  • Only fetch what you need

Do smart things

  • cache everything
  • limit number of queries

Tuning

  • shared_buffers : 25% of available RAM
  • effective_cache_size : OS disk cache size
  • work_mem : in-memory sort size

Less important

  • wal_buffers : set to 16MB
  • checkpoint_segments : at least 10
  • maintenance_work_mem : 50MB for every GB of RAM

Can also transactionally turn on grouping of transactions.

Hardware

  • As much RAM as you can afford - fit whole db if you can.
  • Faster disks.
    • Disk speed is important
    • RAID5 is bad
    • RAID-1+0 is good
    • WAL on own disk → 4x write performance
  • CPU speed - unlikely to be the limiting factor.

Other

  • use pg_bouncer to pool connections
  • use tablespaces to move tables/indexes onto other disks
    • ie, indexes on fastest disk
    • stuff that might run in background and hit only specific tables that are not used by other bits

Keyset Pagination in Django

Pagination is great. Nothing worse than having an HTML page that renders 25000 rows in a table.

Django Pagination is also great. It makes it super easy to declare that a view (that inherits from MultipleObjectMixin) should paginate its results:

class List(ListView):
    queryset = Foo.objects.order_by('bar', '-baz')
    paginate_by = 10
    template_name = 'foo.html'

Django pagination uses the LIMIT/OFFSET method. This is fine for smaller offsets, but once you start getting beyond a few pages, it can perform really badly. This is because the database needs to fetch all of the previous rows, even though it discards them.

Using Keyset Pagination allows for better performing “next page” fetches, at the cost of not being able to randomly fetch a page. That is, if you know the last element from page N-1, then you may fetch page N, but otherwise you really can’t.

Keyset Pagination, sometimes called the Seek Method, has been documented by Markus Winand and Joe Nelson. If you are not familiar with the concept, I strongly suggest you read the articles above.

Django’s pagination is somewhat pluggable: you may switch out the paginator in a Django ListView, for instance, allowing you to do things like switch the Page class, or how the various parts are computed. I used it recently to allow for a different query to be used when calculating the total number of objects in a queryset, to vastly improve performance of a particular paginated queryset.

However, there are limits. Both the view and the paginator expect (nay, demand) an integer page number, which, as we shall see shortly, will not work in this case. I also feel like the view is over-reaching it’s remit by casting the page number to an integer, as I’ll discuss below.

In order to get consistent results with any type of pagination, you must ensure that the ordering on the queryset is stable: that is, there are no rows that will be ‘tied’. Doing otherwise will mean that the database will “break the tie”, and not always in the same order. I’ve seen a bug that was extremely hard to track down that was caused by exactly this problem (and that was just with OFFSET pagination).

Often, to ensure stable ordering, the primary key is used as the “last” sort column. This is perfectly valid, but is not always necessary.

Because in many cases we will need to sort by multiple columns, we’ll need some mechanism for passing through to the paginator the “last value” in a given page for each of these columns. For instance, if we are ordering by timestamp and then group, we would need to pass through both the timestamp and the group of the last object. Because I like to use GET forms to allow me to paginate filtered results, I’ll want to have all of the values combined into one query parameter. If you were constructing links instead, you could look at having these as different parameters. However, you’d still need to be careful, because you aren’t filtering all results on these parameters. Having them serialised into a single parameter (using JSON) means that they are all in the one place, and you can just use that for the filtering to get the page results.

I’ve built a working implementation of keyset pagination, at least for forwards traversal, at django-keyset-pagination.

We can see from this that there really is not that much that we needed to do. We use a different Page object, which enables us to change what the next_page_number will generate. When I figure out how, it will also allow us to work out the previous_page_number

Likewise, we needed to change how we validate a page number, and how we fetch results for a page. That method, _get_page(number) is the one that does most of the work.

Ultimately, we wind up with a filter that looks like:

  WHERE (A < ?) OR (A = ? AND B > ?) OR (A = ? AND B = ? AND C < ?)

The direction of the test (< vs >) depends upon the sorting of that column, but hopefully you get the idea.

In order to enable the query planner to be able to use an index effectively (if one exists), we need to adjust this to (thanks Markus):

WHERE A <= ? AND (
  (A < ?) OR (A = ? AND B > ?) ...
)

It’s also possible, in Postgres at least, to use a ROW() constructor comparison to order rows. However, this only works if the direction of each column ordering is the same, which in my use case it was not. I have a proof of concept of using ROW() constructors, but I need to figure out how to detect if they are available to the database in use.

In order to use the new paginator, we need to work around some issues in the Django class based views: namely that they force an integer value (or use the special string last, neither of which are acceptable in this case):

class PaginateMixin(object):
    "Make pagination work for non integer page numbers"
    def paginate_queryset(self, queryset, page_size):
        # This is very similar to how django currently (2.1) does it: I may submit a PR to use this
        # mechanism instead, as it is more flexible.
        paginator = self.get_paginator(
            queryset, page_size, orphans=self.get_paginate_orphans(),
            allow_empty_first_page=self.get_allow_empty()
        )
        page_kwarg = self.page_kwarg
        page = self.kwargs.get(page_kwarg) or self.request.GET.get(page_kwarg) or 1

        try:
            page_number = paginator.validate_number(page_number)
        except ValueError:
            raise Http404(_('Page could not be parsed.'))

        try:
            page = paginator.page(page_number)
            return (paginator, page, page.object_list, page.has_other_pages())
        except InvalidPage as e:
            raise Http404(
                _('Invalid page (%(page_number)s): %(message)s') % {
                    'page_number': page_number,
                    'message': str(e)
                }
            )

There’s really only one change: instead of just casting the page number to an integer, we let the paginator handle that.

Okay, once all that is done, we can use our paginator:

class List(PaginateMixin, ListView):
    paginator_class = KeysetPaginator
    paginate_by = 10

    def get_queryset(self):
        return Foo.objects.order_by('timestamp', 'bar', 'baz')

We’ll need to change our template rendering to only render a next page link or button, rather than trying to render them for each page. We also don’t have any way to return to the previous page: I’m still working through a mechanism for that.


This post was originally written using the ROW() constructor, and this part of the post discussed the shortcomings. Now that has been resolved, the main shortcoming is that it is not yet possible to traverse to the previous page of results. In many cases that may not be necessary (we could use a browser’s back button, or rely on the fact that if it’s infinite scrolling the data is already in the document), however I would like to investigate how hard it is to actually get the previous page.

Set-returning and row-accepting functions in Django and Postgres

Postgres set-returning functions are an awesome thing. With them, you can do fun things like unnesting and array, and will end up with a new row for each item in the array. For example:

class Post(models.Model):
    author = models.ForeignKey(AUTH_USER_MODEL, related_name='posts')
    tags = ArrayField(base_field=TextField(), null=True, blank=True)
    created_at = models.DateTimeField()
    content = models.TextField()

The equivalent SQL might be something like:

CREATE TABLE blog_post (
  id SERIAL NOT NULL PRIMARY KEY,
  author_id INTEGER NOT NULL REFERENCES auth_user (id),
  tags TEXT[],
  created_at TIMESTAMPTZ NOT NULL,
  content TEXT NOT NULL
);

We can “explode” the table so that we have one tag per row:

SELECT author_id, UNNEST(tags) AS tag, created_at, content
FROM blog_post;

To do the same sort of thing in Django, we can use a Func:

from django.db.models import F, Func

Post.objects.annotate(tag=Func(F('tags'), function='UNNEST'))

In practice, just like in the Django docs, I’ll create a convenience function:

class Unnest(Func):
    function = 'UNNEST'

    @property
    def output_field(self):
        output_fields = [x.output_field for x in self.get_source_expressions()]
        if len(output_fields) == 1:
          return output_fields[0].base_field

        return super(Unnest, self).output_field

The opposite of this is aggregation: in the case of UNNEST, it’s almost ARRAY_AGG, although because of the handling of nested arrays, this doesn’t quite round-trip. We already know how to do aggregation in Django, so I will not discuss that here.

Hovewer, there is another related operation: what if you want to turn a row into something else. In my case, this was turning a row from a result into a JSON object.

SELECT id,
       to_jsonb(myapp_mymodel) - 'id' AS "json"
  FROM myapp_mymodel

This will get all of the columns except ‘id’, and put them into a new column called “json”.

But how do we get Django to output SQL that will enable us to use a Model as the argument to a function? Ultimately, we want to get to the following:

class ToJSONB(Func):
    function = 'TO_JSONB'
    output_field = JSONField()


MyModel.objects.annotate(
  json=ToJSONB(MyModel) - Value('id')
).values('id', 'json')

Our first attempt could be to use RawSQL. However, this has a couple of problems. The first is that we are writing lots of raw SQL, the second is that it won’t work so well if the table is aliased by the ORM. That is, if you use this in a join or subquery, where Django automatically assigns an alias to this table, then referring directly to the table name will not work.

MyModel.objects.annotate(json=Raw("to_jsonb(myapp_mymodel) - 'id'", [], output_field=JSONField()))

Instead, we need to dynamically find out what the current alias for the model is in this query, and use that. We’ll also want to figure out how to “subtract” the id key from the JSON object.

class Table(django.db.models.Expression):
    def __init__(self, model, *args, **kwargs):
        self.model = model
        self.query = None
        super(Table, self).__init__(*args, **kwargs)

    def resolve_expression(self, query, *args, **kwargs):
        clone = super(Table, self).resolve_expression(query, *args, **kwargs)
        clone.query = query
        return clone

    def as_sql(self, compiler, connection, **kwargs):
        if not self.query:
            raise ValueError('Unresolved Table expression')
        alias = self.query.table_map.get(self.model._meta.db_table, [self.model._meta.db_table])[0]
        return compiler.quote_name_unless_alias(alias), []

Okay, there’s a fair bit going on there. Let’s look through it. We’ll start with how we use it:

MyModel.objects.annotate(json=ToJSONB(Table(MyModel)))

We create a Table instance, which stores a reference to the model. Technically, all we need later on is the database table name that will be used, but we’ll keep the model for now.

When the ORM “resolves” the queryset, we grab the query object, and store a reference to that.

When the ORM asks us to generate some SQL, we look at the query object we have a reference to, and see if our model’s table name has an entry in the table_map dict: if so, we get the first entry from that, otherwise we just use the table name.

Okay, what about being able to remove the entry in the JSONB object for ‘id’?

We can’t just use the subtraction operator, because Postgres will try to convert the RHS value into JSONB first, and fail. So, we need to ensure it renders it as TEXT. We also need to wrap it in an ExpressionWrapper, so we can indicate what the output field type will be:

id_value = models.Func(models.Value('id'), template='%(expressions)s::TEXT')
MyModel.objects.annotate(
    json=ExpressionWrapper(
        ToJSONB(Table(MyModel)) - id_value, output_field=JSONField()
    )
)

I also often use a convenience Cast function, that automatically does this based on the supplied output_field, but this is a little easier to use here. Note there is a possible use for ToJSONB in a different context, where it doesn’t take a table, but some other primitive.


There’s one more way we can use this construct: the geo_unique_indexer function from a previous post needs a table name, but also the name of a field to omit from the index. So, we can wrap this up nicely:

class GeoMatch(models.Func):
    function = 'geo_unique_indexer'
    output_field = JSONField()

    def __init__(self, model, *args, **kwargs):
        table = Table(model)
        pk = models.Value(model._meta.pk.db_column or model._meta.pk.name)
        return super(GeoMatch, self).__init__(table, pk, *args, **kwargs)

This is really tidy: it takes the model class (or maybe an instance, I didn’t try), and builds a Table, and gets the primary key. These are just used as the arguments for the function, and then it all works.

Django ComputedField()

A very common pattern, at least in code that I’ve written (and read) is to annotate on a field that uses an expression that is based on one or more other fields. This could then be used to filter the objects, or just in some other way.

The usual method of doing this is:

from django.db import models
from django.db.models.expressions import F, Value
from django.db.models.function import Concat


class PersonQuerySet(models.query.QuerySet):
    def with_name(self):
        return self.annotate(
            name=Concat(F('first_name'), Value(' '), F('last_name'), output_field=models.TextField()),
        )


class Person(models.Model):
    first_name = models.TextField()
    last_name = models.TextField()

    objects = PersonQuerySet.as_manager()

Yes, I’m aware of falsehoods programmers believe about names, but this is an easy-to-follow example.

In order to be able to access the name field, we must use the with_name() queryset method. This is usually okay, but if it is something that we almost always want, it can be a little tiresome. Alternatively, you could override the get_queryset() method of a custom manager, but that makes it somewhat surprising to a reader of the code. There are also some places where a custom manager will not automatically be used, or where it will be cumbersome to include the fields from a custom manager (select_related, for instance).

It would be much nicer if we could write the field declaratively, and have it use the normal django mechanism of defer and only to remove it from the query if required.

class Person(models.Model):
    first_name = models.TextField()
    last_name = models.TextField()
    name = ComputedField(Concat(F('first_name'), Value(' '), F('last_name'), output_field=models.TextField()))

I’ve spent some time digging around in the django source code, and have a fairly reasonable understanding of how fields work, and how queries are built up. But I did wonder how close to a working proof of concept of this type of field we could get without having to change any of the django source code. After all, I was able to backport the entire Subquery expression stuff to older versions of django after writing that. It would be nice to repeat the process here.

There are a few things you need to do to get this to work:

  • store the expression
  • prevent the field from creating a migration
  • ensure the field knows how to interpret data from the database
  • ensure the field adds the expression to it’s serialised version
  • prevent the field from writing data back to the database
  • inject the expression into the query instead of the field name
class ComputedField(models.Field):
    def __init__(self, expression, *args, **kwargs):
        self.expression = expression.copy()
        kwargs.update(editable=False)
        super().__init__(*args, **kwargs)

There is already a mechanism for a field to prevent a migration operation from being generated: it can return a db_type of None.

    def db_type(self, connection):
        return None

We can delegate the responsibility of interpreting the data from the database to the output field of the expression - that’s how it works in the normal operation of expressions.

    def from_db_value(self, value, expression, connection):
        return self.expression.output_field.from_db_value(value, expression, connection)

Storing the expression in the serialised version of a field is explained in the documentation on custom fields:

    def deconstruct(self):
        name, path, args, kwargs = super().deconstruct()
        return name, path, [self.expression] + args, kwargs

To prevent the field from being included in the data we write back to the database turned out to be fairly tricky. There are a couple of mechanisms that can be used, but ultimately the only one that worked in the way I needed was something that is used by the inheritance mechanism. We have to indicate that it is a “private” field. I’m not 100% sure of what the other implications of this might be, but the outcome of making this field private is that it no longer appears in the list of local fields. There is one drawback to this, which I’ll discuss below.

    def contribute_to_class(self, cls, name, private_only=False):
        return super().contribute_to_class(cls, name, True)

So, we only have one task to complete. How do we inject the expression into the query instead of the column?

When django evaluates a queryset, it look at the annotations, and the expressions that are in these. It will then “resolve” these expressions (which means the expression gets told which “query” is being used to evaluate it, allowing it to do whatever it needs to do to make things work).

When a regular field is encountered, it is not resolved: instead it is turned into a Col. This happens in a few different places, but the problem is that a Col should not need to know which query it belongs to: at most it needs to know what the aliased table name is. So, we don’t have a query object we can pass to the resolve_expression method of our expression.

Instead, we’ll need to use Python’s introspection to look up the stack until we find a place that has a reference to this query.

    def get_col(self, alias, output_field=None):
        import inspect

        query = None

        for frame in inspect.stack():
            if frame.function in ['get_default_columns', 'get_order_by']:
                query = frame.frame.f_locals['self'].query
                break
            if frame.function in ['add_fields', 'build_filter']:
                query = frame.frame.f_locals['self']
                break
        else:
            # Aaargh! We don't handle this one yet!
            import pdb; pdb.set_trace()

        col = self.expression.resolve_expression(query=query)
        col.target = self
        return col

So, how does this code actually work? We go through each frame in the stack, and look for a function (or method, but they are really just functions in python) that matches one of the types we know about that have a reference to the query. Then, we grab that, stop iterating and resolve our expression. We have to set the “target” of our resolved expression to the original field, which is how the Col interface works.

This moves the resolve_expression into the get_col, which is where it needs to be. The (resolved) expression is used as the faked column, and it knows how to generate it’s own SQL, which will be put into the query in the correct location.

And this works, almost.

There is one more situation that needs to be taken into account: when we are referencing the field through a join (the x__y lookup syntax you often see in django filters).

Because F() expressions reference the local query, we need to first turn any of these that we find in our computed field’s expression (at any level) into a Col that refers to the correct model. We need to do this before the resolve_expression takes place.

    def get_col(self, alias, output_field=None):
        query = None

        for frame in inspect.stack():
            if frame.function in ['get_default_columns', 'get_order_by']:
                query = frame.frame.f_locals['self'].query
                break
            if frame.function in ['add_fields', 'build_filter']:
                query = frame.frame.f_locals['self']
                break
        else:
            # Aaargh! We don't handle this one yet!
            import pdb; pdb.set_trace()

        def resolve_f(expression):
            if hasattr(expression, 'get_source_expressions'):
                expression = expression.copy()
                expression.set_source_expressions([
                  resolve_f(expr) for expr in expression.get_source_expressions()
                ])
            if isinstance(expression, models.F):
                field = self.model._meta.get_field(expression.name)
                if hasattr(field, 'expression'):
                    return resolve_f(field.expression)
                return Col(alias, field)
            return expression

        col = resolve_f(self.expression).resolve_expression(query=query)
        col.target = self
        return col

There is a repo containing this, which has a bunch of tests showing how the different query types can use the computed field:

https://github.com/schinckel/django-computed-field


But wait, there is one more thing…

A very common requirement, especially if you are planning on using this column for filtering, would be to stick an index on there.

Unfortunately, that’s not currently possible: the mechanism for preventing the field name from being in the write queries, making it a private field, prevents using this field in an index. Anyway, function/expression indexes are not currently supported in Django.

It’s not all bad news though: Markus has a Pull Request that will enable this feauture; from there we could (if db_index is set) automatically add an expression index to Model._meta.indexes in contribute_to_class, but it would also be great to be able to use index_together.

I suspect to get that, though, we’ll need another mechansim to prevent it being in the write queries, but still be a local field.

(Thanks to FunkyBob for suggestions, including suggesting the field at all).

Postgres Generated Columns

A little while ago, I wrote about creating a nice way to have a Django ComputedField. It is pretty neat, except it needs to do some black magic to sniff up the stack to work around a limitation in the way a Ref/Col works in Django.

The way it works is that you define the expression in Python, and it evaluates it in the database, allowing you to query based on this, and have it automatically annotated on.

What it doesn’t do, however, is actually store that value in the database. Indeed, if you are actually querying on this column, you’d probably want to have a functional index that uses the same expression, so that the database can do a reasonable job of improving query times on that column.

New in Postgres 12 is a feature that really piqued my interest: Generated Columns.

These are basically what the ComputedField does, but at the database level. And, instead of it being an expression that is evaluated at query time, it is instead an expression that is evaluated at write time, and stored in an actual column (that could then have an index applied to it).

Let’s have a look at an example:

CREATE TABLE person (
  person_id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  first_name TEXT,
  last_name TEXT,
  full_name TEXT GENERATED ALWAYS AS (
    COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
  ) STORED
);

Again, I’m aware I’m failing to note at least one of the falsehoods programmers believe about names.

Notes about this:

  • I’ve used the similar (preferred) syntax for generating the primary key.
  • You must have the keyword STORED at the end of the column definition: or more specifically, the syntax must be <column> <type> GENERATED ALWAYS AS (<expression>) STORED.
  • You may only refer to other columns within the same row: similar to how a functional index would work.
  • You may not refer to other generated columns: that would likely require parsing the expressions to determine which one to calculate first. I’d love to see postgres implement that at some point though!

So, let’s have a look at that with some data:

INSERT INTO person (first_name, last_name)
VALUES
    ('alice', 'aardvark'),
    ('bob', 'burger'),
    ('chuck', NULL),
    (NULL, 'darris');

And when we query it:

SELECT * FROM person;
 person_id │ first_name │ last_name │   full_name
 ------------------------------------------------------
         1 │ alice      │ aardvark  │ alice aardvark
         2 │ bob        │ burger    │ bob burger
         3 │ chuck      │ <NULL>    │ chuck
         4 │ <NULL>     │ darris    │  darris
(4 rows)

Oh, bother. We didn’t want the space before ‘darris’ (or the one you can’t see, after ‘chuck’). We’ll have to fix that in a sec.

So, what happens when we try to write to the full_name column?

UPDATE person SET first_name = 'dave', full_name='foo' WHERE first_name IS NULL;
ERROR:  column "full_name" can only be updated to DEFAULT
DETAIL:  Column "full_name" is a generated column.

Okay, that’s nice to know. If the error was ignored, we could have just used a custom django field and ignored the value, but we’ll need something similar to how ComputedField prevents writing values. I’ll have to investigate that further.

But, back onto the fact I forgot to trim any leading or trailing spaces. It turns out that there is no way to alter the expression that is being used in a generated column. Which, when you think a little more about it, sort-of makes sense. At the very least, it would need to write new values to each column where the new value was different to the old value.

Instead, you need to drop the column, and re-add it with the correct expression. You’ll almost certainly want to do this in a transaction:

BEGIN;
ALTER TABLE person DROP COLUMN full_name;
ALTER TABLE person ADD COLUMN full_name TEXT
      GENERATED ALWAYS AS (TRIM(
        COALESCE(first_name, '') || ' ' ||
        COALESCE(last_name, '')
      )) STORED;
COMMIT;

And now we can query our table again:

SELECT * FROM person;
 person_id │ first_name │ last_name │   full_name
 ------------------------------------------------------
         1 │ alice      │ aardvark  │ alice aardvark
         2 │ bob        │ burger    │ bob burger
         3 │ chuck      │ <NULL>    │ chuck
         4 │ <NULL>     │ darris    │ darris
(4 rows)

Sweet.

Query Zen is no queries at all

Performing no queries is always going to be faster than performing a query.

Today I had two instances of the same problem: I have two tables, one of which essentially stores calculated data based on other data (and data in other tables, or involving a process that uses application code, and cannot be purely determined within the database).

In one case, we have an audit logging table (which is purely handled within postgres) and another related table that stores a string representation of what the audited object looked like according to the application at that point in time, which needs to be calculated after the fact in Django.

The other case stores some cached values that can be calculated in the database: basically some metadata about a shift according to the location that the shift is at. Changes to the shift table will cause this value to automatically be updated, however we have several million shifts that do not currently have this value, but we need to create items for all shifts that currently don’t have the annotation.

In both cases, we have a celery task that will create a (relatively small, to prevent locks and other performance issues) number of the related objects, but only for those that don’t already have one. The tricky bit is that we need to trigger another instance of the celery task if we still have remaining objects in the database that don’t yet have the related item.

@app.task
def update_missing_items(batch_size=100):
    missing_items = AuditLog.objects.filter(instance_repr=None)
    InstanceRepr.objects.bulk_create([
      InstanceRepr(
        audit_log=log,
        # ...
      ) for log in missing_items[:batch_size]
    ])

    if not missing.exists():
      update_missing_items.apply_async(kwargs={'batch_size': batch_size}, countdown=1)

Since we have some 15 million audit logs (so far), it turns out that this missing.exists() was taking several seconds to run. I tried to write an optimised version, but was not able to improve the performance.

Then, it occurred to me (thanks mlt- on #postgres), that we can look at the number of items we created, and see if it was the same as the batch_size. If it was smaller than the batch size, then we know we are up to date, and don’t need to reschedule our task.

@app.task
def update_missing_items(batch_size=100):
    missing_items = AuditLog.objects.filter(instance_repr=None)
    created = InstanceRepr.objects.bulk_create([
      InstanceRepr(
        audit_log=log,
        # ...
      ) for log in missing_items[:batch_size]
    ])

    if len(created) == batch_size:
      update_missing_items.apply_async(kwargs={'batch_size': batch_size}, countdown=1)

Bingo: since we needed to execute the query to fetch the objects to begin with, we are now doing no extra work to see if we need to run our task again.


The other situation can be done in the database, however a single query of several million rows will block other things from happening, so we want to run the update in batches. There is a trigger on the table so that new or updated rows will already have a value, which actually makes it the same problem, but in SQL:

WITH step_1 AS (
  SELECT shift_id, ...
    FROM shifts
    JOIN ... ON (...)
    LEFT OUTER JOIN annotations USING (shift_id)
    WHERE annotations.shift_id IS NULL
    LIMIT 1000
), step_2 AS (
  ...
),
..., step_N AS (
  ...
)
INSERT INTO annotations (shift_id, ...) SELECT * FROM step_N;

There’s actually a bunch more to this, but it’s not really important: building up the annotations hits a couple of other tables, and I’ve used a CTE because each value is based on a previous annotation.

We can put this query into a task too, but we need some way of determining how many inserts we did. Luckily, Postgres has the RETURNING clause on an INSERT. It would be really nice if we could do:

WITH step_1 AS (...)
INSERT INTO annotations (shift_id, ...)
SELECT * FROM step_N
RETURNING COUNT(*)

Alas, that’s not possible. However, we can just extend our CTE:

WITH step_1 AS (
  SELECT shift_id, ...
    FROM shifts
    ...
    LEFT OUTER JOIN annotations USING (shift_id)
    WHERE annotations.shift_id IS NULL
    -- NOTE: the LIMIT value is a parameter!
    LIMIT %s
),
...,
step_N AS (...),
inserts AS (
  INSERT INTO annotations(shift_id, ...)
  SELECT * FROM step_N
  RETURNING shift_id
)
SELECT COUNT(*) FROM inserts

Then, we can write our celery task:

from django.db import connection

@app.task
def update_missing_annotations(batch_size):
    with connection.cursor() as cursor:
        cursor.execute(QUERY, [batch_size])
        if cursor.fetchone()[0] == batch_size:
            update_missing_annotations.apply_async(kwargs={'batch_size': batch_size}, countdown=1)

Postgres ENUM types in Django

Postgres has the ability to create custom types. There are several kinds of CREATE TYPE statement:

  • composite types
  • domain types
  • range types
  • base types
  • enumerated types

I’ve used a metaclass that is based on Django’s Model classes to do Composite Types in the past, and it’s been working fairly well. The current stuff I have been working on made sense to use an Enumerated Type, because there are four possible values, and having a human readable version of them is going to be nicer than using a lookup table.

In the first iteration, I used just a TEXT column to store the data. However, when I then started to use an enum.Enum class for handling the values in python, I discovered that it was actually storing str(value) in the database, rather than value.value.

So, I thought I would implement something similar to my Composite Type class. Not long after starting, I realised that I could make a cleaner implementation (and easier to declare) using a decorator:

@register_enum(db_type='change_type')
class ChangeType(enum.Enum):
    ADDED = 'added'
    CHANGED = 'changed'
    REMOVED = 'removed'
    CANCELLED = 'cancelled'


ChangeType.choices = [
    (ChangeType.ADDED, _('hours added')),
    (ChangeType.REMOVED, _('hours subtracted')),
    (ChangeType.CHANGED, _('start/finish changed with no loss of hours')),
    (ChangeType.CANCELLED, _('shift cancelled')),
]

Because I’m still on an older version of Python/Django, I could not use the brand new Enumeration types, so in order to make things a bit easier, I then annotate onto the class some extra helpers. It’s important to do this after declaring the class, because otherwise the attributes you define will become “members” of the enumeration. When I move to Django 3.0, I’ll probably try to update this register_enum decorator to work with those classes.

So, let’s get down to business with the decorator. I spent quite some time trying to get it to work using wrapt, before realising that I didn’t actually need to use it. In this case, the decorator is only valid for decorating classes, and we just add things onto the class (and register some things), so it can just return the new class, rather than having to muck around with docstrings and names.

from psycopg2.extensions import (
    new_array_type,
    new_type,
    QuotedString,
    register_adapter,
    register_type,
)
known_types = set()


CREATE_TYPE = 'CREATE TYPE {0} AS ENUM ({1})'
SELECT_OIDS = 'SELECT %s::regtype::oid AS "oid", %s::regtype::oid AS "array_oid"'


class register_enum(object):
    def __init__(self, db_type, managed=True):
        self.db_type = db_type
        self.array_type = '{}[]'.format(db_type)
        self.managed = managed

    def __call__(self, cls):
        # Tell psycopg2 how to turn values of this class into db-ready values.
        register_adapter(cls, lambda value: QuotedString(value.value))

        # Store a reference to this instance's "register" method, which allows
        # us to do the magic to turn database values into this enum type.
        known_types.add(self.register)

        self.values = [
            member.value
            for member in cls.__members__.values()
        ]

        # We need to keep a reference to the new class around, so we can use it later.
        self.cls = cls

        return cls

    def register(self, connection):
        with connection.cursor() as cursor:
            try:
                cursor.execute(SELECT_OIDS, [self.db_type, self.array_type])
                oid, array_oid = cursor.fetchone()
            except ProgrammingError:
                if self.managed:
                    cursor.execute(self.create_enum(connection), self.values)
                else:
                    return

        custom_type = new_type(
            (oid,),
            self.db_type,
            lambda data, cursor: data and self.cls(data) or None
        )
        custom_array = new_array_type(
            (array_oid,),
            self.array_type,
            custom_type
        )
        register_type(custom_type, cursor.connection)
        register_type(custom_array, cursor.connection)

    def create_enum(self, connection):
        qn = connection.ops.quote_name
        return CREATE_TYPE.format(
            qn(self.db_type),
            ', '.join(['%s' for value in self.values])
        )

I’ve extracted out the create_enum method, because it’s then possible to use this in a migration (but I’m not totally happy with the code that generates this migration operation just yet). I also have other code that dynamically creates classes for a ModelField and FormField as attributes on the Enum subclass, but that complicates it a bunch.

Fallback values in Django

It’s not uncommon to have some type of cascading of values in a system. For instance, in our software, we allow a Brand to have some default settings, and then a Location may override some or all of these settings, or just fallback to the brand settings. I’m going to have a look at how this type of thing can be implemented using Django, and a way that this can be handled seamlessly.

We’ll start with our models:

class Brand(models.Model):
    brand_id = models.AutoField(primary_key=True)
    name = models.TextField()


class Location(models.Model):
    location_id = models.AutoField(primary_key=True)
    brand_id = models.ForeignKey(Brand, related_name='locations')
    name = models.TextField()


WEEKDAYS = [
  (1, _('Monday')),
  (2, _('Tuesday')),
  (3, _('Wednesday')),
  (4, _('Thursday')),
  (5, _('Friday')),
  (6, _('Saturday')),
  (7, _('Sunday')),
]


class BrandSettings(models.Model):
    brand = models.OneToOneField(Brand, primary_key=True, related_name='settings')
    opening_time = models.TimeField()
    closing_time = models.TimeField()
    start_day = models.IntegerField(choices=WEEKDAYS)


class LocationSettings(models.Model):
    location = models.OneToOneField(Location, primary_key=True, related_name='_raw_settings')
    opening_time = models.TimeField(null=True, blank=True)
    closing_time = models.TimeField(null=True, blank=True)
    start_day = models.IntegerField(choices=WEEKDAYS, null=True, blank=True)

We can’t use an abstract base model here, because the LocationSettings values are all optional, but the BrandSettings are not. We might have a look later at a way we can have a base model and inherit-and-change-null on the fields. In the place where we have used this, the relationship between Location and Brand is optional, which complicates things even further.

In practice, we’d have a bunch more settings, but this will make it much easier for us to follow what is going on.

To use these, we want to use a value from the LocationSettings object if it is set, else fall-back to the BrandSettings value for that column.

Location.objects.annotate(
    opening_time=Coalesce('settings__opening_time', 'brand__settings__opening_time'),
    closing_time=Coalesce('settings__closing_time', 'brand__settings__closing_time'),
    start_day=Coalesce('settings__start_day', 'brand__settings__start_day'),
)

And this is fine, but we can make it easier to manage: we want to be able to use Location().settings.start_day, and have that fall-back, but also build some niceness so that we can set values in a nice way in the UI.

We can use a postgres view, and then have a model in front of that:

CREATE OR REPLACE VIEW location_actualsettings AS (
  SELECT location_id,
         COALESCE(location.opening_time, brand.opening_time) AS opening_time,
         COALESCE(location.closing_time, brand.closing_time) AS closing_time,
         COALESCE(location.start_day, brand.start_day) AS start_day
    FROM location_location
   INNER JOIN location_brandsettings brand USING (brand_id)
   INNER JOIN location_locationsettings location USING (location_id)
)

Notice that we have used INNER JOIN for both tables: we are making the assumption that there will always be a settings object for each brand and location.

Now, we want a model in front of this:

class ActualSettings(models.Model):
    location = models.OneToOneField(Location, primary_key=True, related_name='settings')
    opening_time = models.TimeField(null=True, blank=True)
    closing_time = models.TimeField(null=True, blank=True)
    start_day = models.IntegerField(choices=WEEKDAYS, null=True, blank=True)

    class Meta:
        managed = False

We want to indicate that it should allow NULL values in the columns, as when we go to update it, None will be taken to mean “use the brand default”.

As for the ability to write to this model, we have a couple of options. The first is to make sure that when we edit instances of the model, we actually use the Location()._raw_settings instance instead of the Location().settings. The other is to make the ActualSettings view have an update trigger:

CREATE OR REPLACE FUNCTION update_location_settings()
RETURNS TRIGGER AS $$

BEGIN

  IF (TG_OP = 'DELETE') THEN
    RAISE NOTICE 'DELETE FROM location_locationsettings WHERE location_id = %', OLD.location_id;
    DELETE FROM location_locationsettings WHERE location_id = OLD.location_id;
    RETURN OLD;
  ELSIF (TG_OP = 'UPDATE') THEN
    UPDATE location_locationsettings
       SET opening_time = NEW.opening_time,
           closing_time = NEW.closing_time,
           start_day = NEW.start_day
     WHERE location_locationsettings.location_id = NEW.location_id;
    RETURN NEW;
  ELSIF (TG_OP = 'INSERT') THEN
    INSERT INTO location_locationsettings (SELECT NEW.*);
    RETURN NEW;
  END IF;
  RETURN NEW;
END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER update_location_settings
       INSTEAD OF INSERT OR UPDATE OR DELETE
       ON location_actualsettings
       FOR EACH ROW EXECUTE PROCEDURE update_location_settings();

And this works as expected: however it is subject to a pretty significant drawback. If you add columns to the table/view, then you’ll need to update the function. Indeed, if you add columns to the tables, you’ll need to update the view too.

In many cases, this will be sufficient: those tables may not change much, and when they do, it’s just a matter of writing new migrations to update the view and function.


In practice, having the writeable view is probably overkill. You can just use a regular view, with a model in front of it, and then use that model when you need to use the coalesced values, but use the raw model when you are setting values.

You can even make it so that as a UI affordance, you show what the brand fallback value is instead of the None value:

class SettingsForm(forms.ModelForm):
    class Meta:
        model = LocationSettings
        fields = (
            'opening_time',
            'closing_time',
            'start_day'
        )

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # We'll probably want to make sure we use a select_related() for this!
        brand = self.instance.location.brand
        brand_settings = brand.settings

        for name, field in self.fields.items():
            # See if the model knows how to display a nice value.
            display = 'get_{}_display'.format(name)
            if hasattr(brand_settings, display):
                brand_value = getattr(brand_settings, display)()
            else:
                brand_value = getattr(brand_settings, name)

            # If we have a time, then we want to format it nicely:
            if isinstance(brand_value, datetime.time):
                brand_value = Template('').render(Context({
                  'value': brand_value
                }))

            blank_label = _('Default for {brand}: {value}').format(
                brand=brand.name,
                value=brand_value,
            )

            # If we have a select that is _not_ a multiple select, then we
            # want to make it obvious that the brand default value can be
            # selected, or an explicit choice made.
            if hasattr(field, 'choices') and field.choices[0][0] == '':
                field.widget.choices = field.choices = [
                    (_('Brand default'), [('', blank_label)]),
                    (_('Choices'), list(field.choices[1:]))
                ]
            else:
                # On all other fields, set the placeholder, so that no value
                # entered will show the brand default label.
                field.widget.attrs['placeholder'] = blank_label

As mentioned in a comment: this uses a couple of lookups to get to the BrandSettings, you’d want to make sure your view used a .select_related():

class LocationSettingsView(UpdateView):
    form_class = SettingsForm

    def get_object(self):
        return LocationSettings.objects.select_related('location__brand__settings').get(
            location=self.kwargs['location']
        )

Again, this is all simplified when we have the requirement that there is always a Brand associated with a Location, and each of these always has a related settings object. It’s the latter part of this that is a little tricky. You can have objects automatically created in a signal handler, but in that case it would have to use default values.


Just from a DRY perspective, it would be great if you could have all three models inherit from the one base class, and have the view and trigger function update automatically.

In order to do that, we’ll need to do a bit of magic.

class SettingsBase(models.Model):
    opening_time = models.TimeField()
    closing_time = models.TimeField()
    start_day = models.IntegerField(choices=WEEKDAYS)

    class Meta:
        abstract = True

    def __init_subclass__(cls):
        if getattr(cls, '_settings_optional', False):
            for field in cls._meta.fields:
                field.null = True
                field.blank = True


class BrandSettings(SettingsBase):
    brand = models.OneToOneField(
        Brand,
        primary_key=True,
        related_name='settings',
        on_delete=models.CASCADE,
    )


class LocationSettings(SettingsBase):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='raw_settings',
        on_delete=models.CASCADE,
    )
    _settings_optional = True


class ActualSettings(SettingsBase):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='settings',
        on_delete=models.DO_NOTHING,
    )
    _settings_optional

    class Meta:
        managed = False

The magic is all clustered in the one spot, and Django’s order it does things makes this easy. By the time __init_subclass__ is evaluated, the subclass exists, and has all of the inherited fields, but none of the non-inherited fields. So, we can update those fields to not be required, if we find a class attribute _settings_optional that is true.

Automatically creating or replacing the view is a bit more work.

class ActualSettings(BaseSettings):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='settings',
        on_delete=models.DO_NOTHING,
    )
    _settings_optional = True

    class Meta:
        managed = False

    @classmethod
    def view_queryset(cls):
        settings = {
            attribute: Coalesce(
              'raw_settings__{}'.format(attribute),
              'brand__settings__{}'.format(attribute)
            ) for attribute in (f.name for f in cls._meta.fields)
            if attribute != 'location'
        }
        return Location.objects.annotate(**settings).values('pk', *settings.keys())

This would then need some extra machinery to put that into a migration, and then, when running makemigrations, we’d want to automatically look at the last rendered version of that view, and see if what we have now differs. However, intercepting makemigrations, and changing the operations it creates is something I have not yet figured out how to achieve.

Instead, for Versioning complex database migrations I wound up creating a new management command.

A nicer syntax might be to have some way of defining a postgres view by using a queryset.

ActualSettings = Location.objects.annotate(
    opening_time=Coalesce('_raw_settings__opening_time', 'brand__settings__opening_time'),
    closing_time=Coalesce('_raw_settings__closing_time', 'brand__settings__closing_time'),
    start_day=Coalesce('_raw_settings__start_day', 'brand__settings__start_day'),
).values('location_id', 'opening_time', 'closing_time', 'start_day').as_view()

The problem with this is that we can’t do that in a model definition, as the other models are not loaded at this point in time.

Another possible syntax could be:

class ActualSettings(View):
    location = models.F('location_id')
    opening_time = Coalesce('_raw_settings__opening_time', 'brand__settings__opening_time')
    closing_time = Coalesce('_raw_settings__closing_time', 'brand__settings__closing_time')
    start_day = Coalesce('_raw_settings__start_day', 'brand__settings__start_day')

    class Meta:
      queryset = Location.objects.all()

… but I’m starting to veer off into a different topic now.


Actually writing a trigger function that handles all columns seamlessly is something that we should be able to do. Be warned though, this one is a bit of a doozy:

CREATE OR REPLACE FUNCTION update_instead()
RETURNS TRIGGER AS $$
DECLARE
  primary_key TEXT;
  target_table TEXT;
  columns TEXT;

BEGIN
  -- You must pass as first parameter the name of the table to which writes should
  -- actually be made.
  target_table = TG_ARGV[0]::TEXT;

  -- We want to get the name of the primary key column for the target table,
  -- if that was not already supplied.
  IF (TG_ARGV[1] IS NULL) THEN
    primary_key = (SELECT column_name
                     FROM information_schema.table_constraints
               INNER JOIN information_schema.constraint_column_usage
                    USING (table_catalog, table_schema, table_name,
                           constraint_name, constraint_schema)
                    WHERE constraint_type = 'PRIMARY KEY'
                      AND table_schema = quote_ident(TG_TABLE_SCHEMA)
                      AND table_name = quote_ident(target_table));
  ELSE
    primary_key = TG_ARGV[1]::TEXT;
  END IF;

  -- We also need the names of all of the columns in the current view.
  columns = (SELECT STRING_AGG(quote_ident(column_name), ', ')
               FROM information_schema.columns
              WHERE table_schema = quote_ident(TG_TABLE_SCHEMA)
                AND table_name = quote_ident(TG_TABLE_NAME));

  IF (TG_OP = 'DELETE') THEN
    EXECUTE format(
      'DELETE FROM %1$I WHERE %2$I = ($1).%2$I',
      target_table, primary_key
    ) USING OLD;
    RETURN OLD;
  ELSIF (TG_OP = 'INSERT') THEN
    -- columns must be treated as a string, because we've already
    -- quoted the columns in the query above.
    EXECUTE format(
      'INSERT INTO %1$I (%2$s) (SELECT ($1).*)',
      target_table, columns
    ) USING NEW;
    RETURN NEW;
  ELSIF (TG_OP = 'UPDATE') THEN
    EXECUTE format(
      'UPDATE %1$I SET (%2$s) = (SELECT ($1).*) WHERE %3$I = ($1).%3$I',
      target_table, columns, primary_key
    ) USING NEW;
    RETURN NEW;
  END IF;

  RAISE EXCEPTION 'Unhandled.';
END;

$$ LANGUAGE plpgsql VOLATILE;

There are some things I learned about postgres when doing this: specifically that you can use the EXECUTE format('SELECT ... ($1).%s', arg) USING NEW syntax: the format() function makes it much neater than using string concatenation, and using the EXECUTE '...($1).%s' USING ... form was the only way I was able to access the values from the NEW and OLD aliases within an execute. There’s also a bunch of stuff you have to do to make sure that the columns line up correctly when updating or inserting into the target table.

We can then apply this to our view:

CREATE TRIGGER update_instead
INSTEAD OF UPDATE OR INSERT OR DELETE
ON location_actualsettings
FOR EACH ROW
EXECUTE PROCEDURE update_instead('location_locationsettings', 'location_id');

Handling overlapping values

One of the things that I enjoy most about Postgres are the rich types. Using these types can help reduce the amount of validation that the application needs to do.

Take for instance anything which contains a start date and a finish date. If you model this using two fields, then you also need to include validation about start <= finish (or perhaps start < finish, depending upon your requirements).

If you use a date range instead, then the database will do this validation for you. It is not possible to create a range value that is “backwards”. Sure, you’ll also need to do application-level (and probably client-side) validation, but there is something nice about having a reliable database that ensures you cannot possibly have invalid data.

Django is able to make good use of range types, and most of my new code seemingly has at least one range type: often a valid_period. So much so that I have a Mixin and a QuerySet that make dealing with these easier:

class ValidPeriodMixin(models.Model):
    valid_period = DateRangeField()

    class Meta:
        abstract = True

    @property
    def start(self):
        if self.valid_period.lower_inc:
            return self.valid_period.lower
        elif self.valid_period.lower is not None:
            return self.valid_period.lower + datetime.timedelta(1)

    @property
    def finish(self):
        if self.valid_period.upper_inc:
            return self.valid_period.upper
        elif self.valid_period.upper is not None:
            return self.valid_period.upper - datetime.timedelta(1)

    @property
    def forever(self):
        return self.valid_period.lower is None and self.valid_period.upper is None

    def get_valid_period_display(self):
        if self.forever:
            message = _('Always applies')
        elif self.start is None:
            message = _('{start} \u2092 no end date')
        elif self.finish is None:
            message = _('no start date \u2092 {finish}')
        else:
            message = _('{start} \u2092 {finish}')

        return message.format(
            start=self.start,
            finish=self.finish,
        )


def ensure_date_range(period):
    """
    If we have a 2-tuple of dates (or strings that are valid dates),
    ensure we turn that into a DateRange instance. This is because
    otherwise Django may mis-interpret this.
    """
    if not isintance(period, DateRange):
        return DateRange(period[0] or None, period[1] or None, '[]')
    return period


class OverlappingQuerySet(models.query.QuerySet):
    def overlapping(self, period):
        return self.filter(valid_period__overlap=ensure_date_range(period))

    def on_date(self, date):
        return self.filter(valid_period__contains=date)

    def today(self):
        return self.on_date(datetime.date.today())

As you may notice from this, it is possible to do some filtering based on range types: specifically, you can use the && Postgres operator using .filter(field__overlap=value), and the containment operators (<@ and @>) using .filter(field__contains=value) and .filter(field__contained_by=value). There are also other operators we will see a bit later using other lookups.


If you have a legacy table that stores a start and a finish, you would need to have a validator on the model (or forms that write to the model) that ensures start < finish, as mentioned above. Also, there is no way (without extra columns) to tell if the upper and lower values should be inclusive or exclusive of the bounds. In Postgres, we write range values using a notation like a mathematical range: using ‘[’, ‘]’ and ‘(‘, ‘)’ to indicate inclusive and exclusive bounds.

SELECT '[2019-01-01,2020-01-01)'::DATERANGE AS period;

One caveat when dealing with discrete range types (like dates and integers) is that Postgres will, if it is able to, convert the range to a normalised value: it will store (2019-01-01,2019-12-31] as [2019-01-02,2020-01-01). This can become a problem when showing the value back to the user, because depending upon context, it’s likely that you will want to use inclusive bounds when showing and editing the values.

You can manage this by using a form field subclass that detects an exclusive upper bound and subtracts one “unit” accordingly:

import datetime

from django.contrib.postgres.forms.ranges import (
    DateRangeField, IntegerRangeField
)


class InclusiveRangeMixin(object):
    _unit_value = None

    def compress(self, values):
        range_value = super().compress(values)
        if range_value:
          return self.range_type(
              range_value.lower,
              range_value.upper,
              bounds='[]'
          )

    def prepare_value(self, value):
        value = super().prepare_value(value)
        value = [
            field.clean(val)
            for field, val in zip(self.fields, value)

        ]
        if value[1] is not None:
            value[1] = value[1] - self._unit_value
        return value


class InclusiveDateRangeField(
    InclusiveRangeMixin, DateRangeField
):
      _unit_value = datetime.timedelta(1)


class InclusiveIntegerRangeField(
    InclusiveRangeMixin, IntegerRangeField
):
    _unit_value = 1

Back on to the topic of storing two values instead of a range: it’s possible to add an expression index on the table that uses DATERANGE:

CREATE INDEX thing_period_idx
          ON thing_thing (DATERANGE(start, finish));

You would be able to annotate on this value, do some querying, and it should use the index, allowing you to build querysets like:

Thing.objects.annotate(
    period=Func(
      F('start'),
      F('finish'),
      function='DATERANGE',
      output_field=DateRangeField())
).filter(period__overlap=other_period)

Range types show their full power when used with exclusion constraints. These allow you to prevent writing rows that violate the constraint. For instance, consider this model (and some largely irrelevant other models, Team and Player):

class TeamMembership(ValidPeriodMixin):
    ployer = models.ForeignKey(
        Player,
        related_name='team_memberships',
        on_delete=models.CASCADE,
    )
    team = models.ForeignKey(
        Team,
        related_name='player_memberships',
        on_delete=models.CASCADE,
    )

A player may only belong to one team at a time: that is, we may not have any overlapping valid_periods for a player.

You can do this using an exclusion constraint, but it does need the btree_gist extension installed:

CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE team_teammembership
        ADD CONSTRAINT prevent_overlapping_team_memberships
    EXCLUDE USING gist(person_id WITH =, valid_period WITH &&)
 DEFERRABLE INITIALLY DEFERRED;

Since this type of constraint is not yet supported in Django, you’ll have to do it in a RunSQL migration.

From here, we can attempt to write conflicting data, but the database will forbid it. You will still need to write code that checks before writing - this enables you to return a ValidationError to the user when you detect this conflict in a form, but having the exclusion constraint means that we can avoid the race condition where:

  • Check for overlapping ranges
  • Other process creates a range that will overlap
  • Save our data

You could possibly also use select_for_update in this context, but I prefer adding database constraints.

Note that the DEFERRABLE INITIALLY DEFERRED clause is important: it allows you, within a transaction, to write conflicting data, and it’s only when the transaction commits that the constraint is checked. This makes rewriting a bunch of values in one transaction much simpler: if you do not have this flag enabled then you will need to ensure you update them in an order that maintained no overlaps at each stage. I’m pretty confident this is always possible, but it’s a bunch of work (and it is possible that you might need to write some rows multiple times to maintain that).


So, now we can store range values (with database validation), and prevent overlapping data (with database validation).

What about a process that enables us to say “this row should replace, trim or split any that overlap with it”? I’m glad you asked.

It turns out given two rows, where one should “supersede” the other, there are five different conditions we need to take into account:

  • The rows do not overlap: no action required
  • The new row completely covers the old row: remove the old row
  • The old row has bounds that exceed the new row in both directions: split the old row into two rows
  • The old row has a lower bound that is smaller than the new row: trim the old row at the upper end
  • The old row has an upper bound that is larger than the new row: trim the old row at the lower end

It turns out we can perform this query with the Django range field lookups:

class OverlappingQuerySet(models.query.QuerySet):
    def with_overlap_type(self, period):
        period = ensure_date_range(period)
        return self.annotate(
            overlap_type=Case(
                # The objects do not overlap.
                When(~Q(valid_period__overlap=period,
                        then=Value(None))),
                # The existing value is covered by the new value
                When(valid_period__contained_by=period,
                     then=Value('replace')),
                # The existing value has no values
                # less than the new value
                When(valid_period__not_lt=period,
                     then=Value('trim:lower')),
                # The existing value has no values
                # greater than the new value
                When(valid_period__not_gt=period,
                     then=Value('trim:upper')),
                # The existing value contains the new value
                When(valid_period__contains=period,
                      then=Value('split')),
                output_field=models.TextField()
            )
        )

This works because a CASE WHEN stops evaluating when it finds a match: technically a trim:lower value could also match on containment (split), so we need to test that one earlier.

We are going to have to (possibly) perform multiple queries when writing back the data. If there are any than need to be “removed”, they will need a DELETE. Any that have a “trim” operation will require an UPDATE.

new_instance = Thing(valid_period=('2019-01-01', '2019-02-09'))
overlapping = Thing.objects.overlapping(
  new_instance.valid_period
).with_overlap_type(new_instance.valid_period)

overlapping.filter(overlap_type='replace').delete()
overlapping.filter(
    overlap_type__in=('trim:upper', 'trim:lower')
).update(
    valid_period=valid_period - new_instance.valid_period
)

But the tricky part is that any that are “split” will require at least two: either a DELETE followed by an INSERT (that inserts two rows), or a single UPDATE and a single INSERT. The tricky part here is that we also need to read the values first, if we are going to manipulate them in python. Instead, we can look at how to do it in raw SQL, with the benefit that we can perform this in a single operation.

WITH new_period AS (
  SELECT %s AS new_period
),
split AS (
  SELECT thing_id,
         valid_period,
         other_field,
         new.new_period
    FROM thing_thing old
    INNER JOIN new_period new ON (
          LOWER(old.valid_period) < LOWER(new.new_period)
      AND UPEER(old.valid_period) > UPEER(new.new_period)
    )
), new_rows AS (
  SELECT other_field,
         DATERANGE(LOWER(valid_period),
                   LOWER(new_period)) AS valid_period
    FROM split

   UNION ALL

  SELECT other_field,
         DATERANGE(UPPER(new_period),
                   UPPER(valid_period)) AS valid_period
),
removed AS (
  DELETE FROM thing_thing
   WHERE thing_id IN (SELECT thing_id FROM split)
)
INSERT INTO thing_thing (other_field, valid_period)
SELECT other_field, valid_period FROM new_rows;

This is less than ideal, because we need to enumerate all of the fields (instead of just other_field), so this code is not especially reusable as-is.

Let’s look at alternatives:

# Fetch the existing items.
splits = list(overlapping.filter(overlap_type='split').values())
to_create = []
to_delete = []
for overlap in splits:
    to_delete.append(overlap.pop('thing_id'))
    valid_period = overlap.pop('valid_period')
    to_create.append(Thing(
        valid_period=(valid_period.lower, new_instance.valid_period.lower),
        **overlap
    ))
    to_create.append(Thing(
        valid_period=(new_instance.valid_period.upper, valid_period.upper),
        **overlap
    ))
overlapping.filter(pk__in=to_delete).delete()
Thing.objects.bulk_create(to_create)

We can stick all of that into a queryset method, to make it easier to manage.

import copy


class OverlappingQuerySet(models.query.QuerySet):
    def trim_overlapping(self, period):
        """
        Trim/split/remove all overlapping objects.

        * Remove objects in the queryset that are
          "covered" by the period.
        * Split objects that completely cover the
          new period with overlap at both sides
        * Trim objects that intersect with the new
          period and extend in one direction or the
          other, but not both.

        This will do a single query to trim object that need
        trimming, another query that fetches those that need
        splitting, a single delete query to remove all
        split/replaced objects, and finally an optional query
        to create replacement objects for those split.

        That means this method _may_ perform 3 or 4 queries.

        This particular algorithm should work without a
        transaction needing to be present, but in practice
        this action and the create of a new one should be
        in the same transaction, so they can all roll-back
        if anything goes wrong.
        """
        period = ensure_date_range(period)

        overlapping = self.overlapping(period)\
                          .with_overlap_type(period)

        # Easy first: update those that we can just update.
        overlapping.filter(
            overlap_type__startswith=('trim')
        ).update(
            valid_period=models.F('valid_period') - period
        )

        # Create the new objects for each of the ones that
        # extend either side of the new value.
        # There will alwasy be two of them: one for the lower
        # section, and one for the upper section.
        to_create = []
        for instance in overlapping.filter(overlap_type='split'):
            # Setting the primary key to None will trigger a new
            # instance.
            instance.pk = None
            # We need to create two instances, each with a different
            # valid_period.
            valid_period = instance.valid_period
            # The one _before_ the new value.
            instance.valid_period = DateRange(
                valid_period.lower, period.lower, bounds='[)'
            )
            to_create.append(instance)
            # And a new copy to go _after_ the new value.
            instance = copy.deepcopy(instance)
            instance.valid_period = DateRange(
                period.upper, valid_period.upper, bounds='(]'
            )
            to_create.append(instance)


        # Now clean up any that we need to get rid of.
        overlapping.filter(
            overlap_type__in=('replace', 'split')
        ).delete()

        # And finally add back in any replacement objects
        # that extended either side of the new value.
        if to_create:
            self.model._default_manager.bulk_create(to_create)

Yeah, I think that will do for now.