Django bulk_update without upsert

Postgres 9.5 brings a fantastic feature, that I’ve really been looking forward to. However, I’m not on 9.5 in production yet, and I had a situation that would really have benefitted from being able to use it.

I have to insert lots of objects, but if there is already an object in a given “slot”, then I need to instead update that existing object.

Doing this using the Django ORM can be done one a “one by one” basis, by iterating through the objects, finding which one (if any) matches the criteria, updating that, or creating a new one if there wasn’t a match.

However, this is really slow, as it does two queries for each object.

Instead, it would be great to:

  • fetch all of the instances that could possibly overlap (keyed by the matching criteria)
  • iterate through the new data, looking for a match
    • modify the instance if an existing match is made, and stash into pile “update”
    • create a new instance if no match is found, and stash into the pile “create”
  • bulk_update all of the “update” objects
  • bulk_create all of the “create” objects

Those familiar with Django may recognise that there is only one step here that cannot be done as of “now”.

So, how can we do a bulk update?

There are two ways I can think of doing it (at least with Postgres):

  • create a temporary table (cloning the structure of the table)
  • insert all of the data into this table
  • update the rows in the original table from the temporary table, based on pk column

and:

  • come up with some mechanism of using the UPDATE the_table SET ... FROM () sq WHERE sq.pk = the_table.pk syntax

It’s possible to use some of the really nice features of Postgres to create a temporary table, that clones an existing table, and will automatically be dropped at the end of the transaction:

BEGIN;

CREATE TEMPORARY TABLE upsert_source (LIKE my_table INCLUDING ALL) ON COMMIT DROP;

-- Bulk insert into upsert_source

UPDATE my_table
   SET foo = upsert_source.foo,
       bar = upsert_source.bar
  FROM upsert_source
 WHERE my_table.id = upsert_source.id;

The drawbacks of this are that it does two extra queries, but it is possible to implement fairly simply:

from django.db import transaction, connection

@transaction.atomic
def bulk_update(model, instances, *fields):
    cursor = connection.cursor()
    db_table = model._meta.db_table

    try:
        cursor.execute(
            'CREATE TEMPORARY TABLE update_{0} (LIKE {0} INCLUDING ALL) ON COMMIT DROP'.format(db_table)
        )

        model._meta.db_table = 'update_{}'.format(db_table)
        model.objects.bulk_create(instances)

        query = ' '.join([
            'UPDATE {table} SET ',
            ', '.join(
                ('%(field)s=update_{table}.%(field)s' % {'field': field})
                for field in fields
            ),
            'FROM update_{table}',
            'WHERE {table}.{pk}=update_{table}.{pk}'
        ]).format(
            table=db_table,
            pk=model._meta.pk.get_attname_column()[1]
        )
        cursor.execute(query)
    finally:
        model._meta.db_table = db_table

The avantage of this is that it mostly just uses the ORM. There’s limited scope for SQL injection (although you’d probably want to validate the field names).

It’s also possible to do the update directly from a subquery, but without the nice column names:

UPDATE my_table
   SET foo = upsert_source.column2,
       column2 = upsert_source.column3
  FROM (
    VALUES (...), (...)
  ) AS upsert_source
 WHERE upsert_source.column1 = my_table.id;

Note that you must make sure your values are in the correct order (with the primary key first).

Attempting to prevent some likely SQL injection vectors, we want to build up the fixed parts of the query (and the parts that are controlled by the django model, like the table and field names), and then pass the values in as query parameters.

from django.db import connection

def bulk_update(model, instances, *fields):
    set_fields = ', '.join(
        ('%(field)s=update_{table}.column%(i)s' % {'field': field, 'i': i + 2})
        for i, field in enumerate(fields)
    )
    value_placeholder = '({})'.format(', '.join(['%s'] * (len(fields) + 1)))
    values = ','.join([value_placeholder] * len(instances))
    query = ' '.join([
        'UPDATE {table} SET ',
        set_fields,
        'FROM (VALUES ', values, ') update_{table}',
        'WHERE {table}.{pk} = update_{table}.column1'
    ]).format(table=model._meta.db_table, pk=model._meta.pk.get_attname_column()[1])
    params = []
    for instance in instances:
        data.append(instance.pk)
        for field in fields:
            params.append(getattr(instance, field))

    connection.cursor().execute(query, params)

This feels like a reasonable first draft, however I’d probably want to go look at how the query for bulk_create is created, and modify that. There’s a fair bit going on there that I haven’t followed as yet though. Note that this does not need the @transaction.atomic decorator, as it is only a single statement.

From here, we can build an upsert that assumes all objects with a PK need to be updated, and those without need to be inserted:

from django.utils.functional import partition
from django.db import transaction

@transaction.atomic
def bulk_upsert(model, instances, *fields):
    update, create = partition(lambda obj: obj.pk is None, instances)
    if update:
        bulk_update(model, update, *fields)
    if create:
        model.objects.bulk_create(create)

Django second AutoField

Sometimes, your ORM just seems to be out to get you.

For instance, I’ve been investigating a technique for the most important data structure in a system to be essentially immuatable.

That is, instead of updating an existing instance of the object, we always create a new instance.

This requires a handful of things to be useful (and useful for querying).

  • We probably want to have a self-relation so we can see which object supersedes another. A series of objects that supersede one another is called a lifecycle.
  • We want to have a timestamp on each object, so we can view a snapshot at a given time: that is, which phase of the lifecycle was active at that point.
  • We should have a column that unique per-lifecycle: this makes for querying all objects of a lifecycle much simpler (although we can use a recursive query for that).
  • There must be a facility to prevent multiple heads on a lifecycle: that is, at most one phase of a lifecycle may be non-superseded.
  • The lifecycle phases needn’t be in the same order, or really have any differentiating features (like status). In practice they may, but for the purposes of this, they are just “what it was like at that time”.

I’m not sure these ideas will ever get into a released product, but the work behind them was fun (and all my private work).

The basic model structure might look something like:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = models.AutoField(primary_key=False, editable=False)

    superseded_by = models.OneToOneField('self',
        related_name='supersedes',
        null=True, blank=True, editable=False
    )
    timestamp = models.DateTimeField(auto_now_add=True)

    # Any other fields you might want...

    objects = PhaseQuerySet.as_manager()

So, that looks nice and simple.

Our second AutoField will have a sequence generated for it, and the database will give us a unique value from a sequence when we try to create a row in the database without providing this column in the query.

However, there is one problem: Django will not let us have a second AutoField in a model. And, even if it did, there would still be some problems. For instance, every time we attempt to create a new instance, every AutoField is not sent to the database. Which breaks our ability to keep the lifecycle_id between phases.

So, we will need a custom field. Luckily, all we really need is the SERIAL database type: that creates the sequence for us automatically.

class SerialField(object):
    def db_type(self, connection):
        return 'serial'

So now, using that field type instead, we can write a bit more of our model:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = SerialField(editable=False)
    superseded_by = models.OneToOneField('self', ...)
    timestamp = models.DateTimeField(auto_now_add=True)

    def save(self, **kwargs):
        self.pk = None
        super(Phase, self).save(**kwargs)

This now ensures each time we save our object, a new instance is created. The lifecycle_id will stay the same.

Still not totally done though. We currently aren’t handling a newly created lifecycle (which should be handled by the associated postgres sequence), nor are we marking the previous instance as superseded.

It’s possible, using some black magic, to get the default value for a database column, and, in this case, execute a query with that default to get the next value. However, that’s pretty horrid: not to mention it also runs an extra two queries.

Similarly, we want to get the phase_id of the newly created instance, and set that as the superseded_by of the old instance. This would require yet another query, after the INSERT, but also has the sinister side-effect of making us unable to apply the not-superseded-by-per-lifecycle requirement.

As an aside, we can investigate storing the self-relation on the other end - this would enable us to just do:

    def save(self, **kwargs):
        self.supersedes = self.pk
        self.pk = None
        super(Phase, self).save(**kwargs)

However, this turns out to be less useful when querying: we are much more likely to be interested in phases that are not superseded, as they are the “current” phase of each lifecycle. Although we could query, it would be running sub-queries for each row.

Our two issues: setting the lifecycle, and storing the superseding data, can be done with one Postgres BEFORE UPDATE trigger function:

CREATE FUNCTION lifecycle_and_supersedes()
RETURNS TRIGGER AS $$

  BEGIN
    IF NEW.lifecycle_id IS NULL THEN
      NEW.lifecycle_id = nextval('phase_lifecycle_id_seq'::regclass);
    ELSE
      NEW.phase_id = nextval('phase_phase_id_seq'::regclass);
      UPDATE app_phase
        SET superseded_by_id = NEW.phase_id
        WHERE group_id = NEW.group_id
        AND superseded_by_id IS NULL;
    END IF;
  END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER lifecycle_and_supersedes
  BEFORE INSERT ON app_phase
  FOR EACH ROW
  EXECUTE PROCEDURE lifecycle_and_supersedes();

So, now all we need to do is prevent multiple-headed lifecycles. We can do this using a UNIQUE INDEX:

CREATE UNIQUE INDEX prevent_hydra_lifecycles
ON app_phase (lifecycle_id)
WHERE superseded_by_id IS NULL;

Wow, that was simple.

So, we have most of the db-level code written. How do we use our model? We can write some nice queryset methods to make getting the various bits easier:

class PhaseQuerySet(models.query.QuerySet):
    def current(self):
        return self.filter(superseded_by=None)

    def superseded(self):
        return self.exclude(superseded_by=None)

    def initial(self):
        return self.filter(supersedes=None)

    def snapshot_at(self, timestamp):
        return filter(timestamp__lte=timestamp).order_by('lifecycle_id', '-timestamp').distinct('lifecycle_id')

The queries generated by the ORM for these should be pretty good: we could look at sticking an index on the lifecycle_id column.

There is one more thing to say on the lifecycle: we can add a model method to fetch the complete lifecycle for a given phase, too:

    def lifecycle(self):
        return self.model.objects.filter(lifecycle_id=self.lifecycle_id)

(That was why I used the lifecycle_id as the column).


Whilst building this prototype, I came across a couple of things that were also interesting. The first was a mechanism to get the default value for a column:

def database_default(table, column):
    cursor = connection.cursor()
    QUERY = """SELECT d.adsrc AS default_value
               FROM   pg_catalog.pg_attribute a
               LEFT   JOIN pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum)
                                                   = (d.adrelid,  d.adnum)
               WHERE  NOT a.attisdropped   -- no dropped (dead) columns
               AND    a.attnum > 0         -- no system columns
               AND    a.attrelid = %s::regclass
               AND    a.attname = %s"""
    cursor.execute(QUERY, [table, column])
    cursor.execute('SELECT {}'.format(*cursor.fetchone()))
    return cursor.fetchone()[0]

You can probably see why I didn’t want to use this. Other than the aforementioned two extra queries, it’s executing a query with data that comes back from the database. It may be possible to inject a default value into a table that causes it to do Very Bad Things™. We could sanitise it, perhaps ensure it matches a regular expression:

NEXTVAL = re.compile(r"^nextval\('(?P<sequence>[a-zA-Z_0-9]+)'::regclass\)$")

However, the trigger-based approach is nicer in every way.

The other thing I discovered, and this one is really nice, is a way to create an exclusion constraint that only applies if a column is NULL. For instance, ensure that no two classes for a given student overlap, but only if they are not superseded (or deleted).

ALTER TABLE "student_enrolments"
ADD CONSTRAINT "prevent_overlaps"
EXCLUDE USING gist(period WITH &&, student_id WITH =)
WHERE (
  superseded_by_id IS NULL
  AND
  status <> 'deleted'
);

Django Proxy Model Relations

I’ve got lots of code I’d do a different way if I were to start over, but often, we have to live with what we have.

One situation I would seriously reconsider is the structure I use for storing data related to how I interact with external systems. I have an Application object, and I create instances of this for each external system I interact with. Each new Application gets a UUID, and is created as part of a migration. Code in the system uses this UUID to determine if something is for that system.

But that’s not the worst of it.

I also have an AppConfig object, and other related objects that store a relation to an Application. This was fine initially, but as my code got more complex, I hit upon the idea of using Django’s Proxy models, and using the related Application to determine the subclass. So, I have AppConfig subclasses for a range of systems. This is nice: we can even ensure that we only get the right instances (using a lookup to the application to get the discriminator, which I’d probably do a different way next time).

However, we also have other bits of information that we need to store, that has a relation to this AppConfig object.

And here is where we run into problems. Eventually, I had the need to subclass these other objects, and deal with them. That gives a similar benefit to above for fetching filtered lists of objects, however when we try to follow the relations between these, something annoying happens.

Instead of getting the subclass of AppConfig, that we probably want to use because the business logic hangs off that, we instead get the actual AppConfig instances. So, in order to get the subclass, we have to fetch the object again, or swizzle the __class__. And, going back the other way would have the same problem.

Python is a dynamic language, so we should be able to do better.

In theory, all we have to do is replace the attributes on the two classes with ones that will do what we want them to do. In practice, we need to muck around a little bit more to make sure it all works out right.

It would be nice to be able to decorate the declaration of the overridden field, but that’s not valid python syntax:

>>> class Foo(object):
...   @override
...   bar = object()
  File "<stdin>", line 3
    bar = object()

So, we’ll have to do one of two things: alter the class after it has been defined, or leverage the metaclass magic Django already does.

class Foo(models.Model):
    bar = models.ForeignKey('bar.Bar')


class FooProxy(models.Model):
    bar = ProxyForeignKey('bar.BarProxy')  # Note the proxy class

    class Meta:
      proxy = True

However, we can’t just use the contribute_to_class(cls, name) method as-is, as the Proxy model attributes get dealt with before the parent model. So, we need to register a signal, and get the framework to run our code after the class has been prepared:

class ProxyField(object):
    def __init__(self, field):
        self.field = field

    def contribute_to_class(self, model, name):
        @receiver(models.signals.class_prepared, sender=model, weak=False)
        def late_bind(sender, *args, **kwargs):
          override_model_field(model, name, self.field)


class ProxyForeignKey(ProxyField):
    def __init__(self, *args, **kwargs):
        super(ProxyForeignKey, self).__init__(ForeignKey(*args, **kwargs))

Then, it’s a matter of working out what needs to happen to override_model_field.

It turns out: not much. Until we start thinking about edge cases, anyway:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    model.add_to_class(name, field)
    if field.rel:
      field.rel.to.add_to_class(
        field.related_name,
        ForeignRelatedObjectsDescriptor(field.related)
      )

There is a little more to it than that:

  • We need to use the passed-in related_name if one was provided in the new field definition, else we want to use what the original field’s related name was. However, if not explicitly set, then neither field will actually have a related_name attribute.
  • We cannot allow an override of a foreign key to a non-proxy model: that would hijack the original model’s related queryset.
  • Similarly, we can only allow a single proxy to override-relate to another proxy: any subsequent override-relations would likewise hijack the related object queryset.
  • For non-related fields, we can only allow an override if the field is compatible. What that means I’m not completely sure just yet, but for now, we will only allow the same field class (or a subclass). Things that would require a db change would be verboten.

So, we need to guard our override_model_field somewhat:

def override_model_field(model, name, field):
    original_field = model._meta.get_field(name)

    if not isinstance(field, original_field.__class__):
        raise TypeError('...')

    # Must do these checks before the `add_to_class`, otherwise it breaks tests.
    if field.rel:
        if not field.rel.to._meta.proxy:
            raise TypeError('...')

        related_name = getattr(field, 'related_name', original_field.related.get_accessor_name())
        related_model = getattr(field.rel.to, related_name).related.model

        # Do we already have an overridden relation to this model?
        if related_model._meta.proxy:
            raise TypeError('...')

    model.add_to_class(name, field)

    if field.rel:
        field.rel.to.add_to_class(
          related_name,
          ForeignRelatedObjectsDescriptor(field.related)
        )

There is an installable app that includes tests: django-proxy-overrides.

Avoiding SQL Antipatterns using Django (and Postgres)

The book SQL Antipatterns is one of my favourite books. I took the opportunity to reread it on a trip to Xerocon in Sydney, and as usual it enlightened me to thing I am probably doing in my database interactions.

So, I’m going to look at these Antipatterns, and discuss how you can avoid them when using Django. This post is intended to be read with each chapter of the book. I’ve used the section headings, but instead of the chapter headings, I’ve used the Antipattern headings. They are still in the same order, though.

It seems the printed version of this book is on sale now: I’m tempted to buy a few extra copies for gifts. Ahem, cow-orkers.

Logical Database Design Antipatterns

Format Comma-Separated Lists

This one is pretty simple: use a relation instead of a Comma Separated field. In the cases described in the book, a ManyToManyField is in fact simpler than a Comma Separated field. Django gets a gold star here, both in ease of use, but also in documentation about relations.

However, there may be times when a relation is overkill, and a real array is better. For instance, when storing data related to which days of the week are affected by a certain condition, it may make sense to store it in this way.

But we can do better than a simple Comma Separated field. Storing the data in a Postgres Array means we can rely on the database to validate the data, and allows searching. Similarly, we could store it in JSON, too.

I’ve maintained a JSONField for Django, although it’s not easily queryable. However, an ArrayField is coming in Django 1.8. There are alternatives already available if you need to use one now. I’ve got a project to mostly backport the django.contrib.postgres features to 1.7: django-postgres.

Things like JSON, Array and Hstore are a better solution than storing other-delimitered values in a straight text column too. With Django 1.7, it became possible to have lookups, which can leverage the DBMS’ ability to query these datatypes.

Always Depend on One’s Parent

Read chapter online.

Straight into a trickier one! And, Django’s documentation points out how to create his type of relation, but does not call out the possible issues. This book is worth it for this section alone.

So, how do we deal with trees in Django?

We can use django-mptt. This gives us (from what I can see) the “Nested Sets” pattern outlined in the book, but under the name “Modified Preorder Tree Traversal”.

I’m quite interested in the idea of using a Closure Table, and there are a couple of projects with quite different approaches to this:

  • django-ctt: uses a Model class you inherit from.
  • django-ct: better documented, but uses an unusual pattern of a pseudo-manager-thing.

Knowing me, I’m probably going to spend some time building a not-complete implementation at some point.

Update: Whilst I haven’t built an implementation of a Closure Table, I did implement recursive queries for an Adjacency List.

One Size Fits All

Using a field id for all tables by default is probably one of the biggest mistakes I think Django makes. And, as we shall see, we can’t yet avoid them, for at least a subset of situations.

Indeed, Django can use any single column for the primary key, and doesn’t require the use of a key column of name id. So, in my mind, it would have been better to use the <tablename>_id, as suggested in the book. Especially since you may also access the primary key attribute using the pk shortcut.

class Foo(models.Model):
    foo_id = models.AutoField(primary_key=True)

However, it’s not currently possible to do composite primary keys (but may be soon), which makes doing the best thing for a plain ManyToManyField possible: indeed, you don’t control that table anyway, and if you remove the id column (and create a proper primary key), things don’t work. In practice, you can just ignore this issue, since you (mostly) don’t deal with this table, or the objects from it.

So, assuming we are changing the id column into the name suggested in the book, what does that give us?

Nothing, until we actually need to write raw SQL code, and specifically code that joins multiple tables.

Then, we are able to use a slightly less verbose way of defining the join, and not worry about duplicate columns named id:

SELECT * FROM foo_foo JOIN foo_bar USING (foo_id);

I’m still not sure if it’s actually worthwhile doing this or not. I’m going to start doing it, just to see whether there are any drawbacks (already found one in some of my own code, that hard-coded an id field), or any great benefits.

Leave out the Constraints

Within Django, it’s more work to create relations without the relevant constraints, and it’s not possible to create a table without a primary key, so we can just pass this one by with a big:

smile and wave, boys

Use a Generic Attribute Table

Again, it’s possible to create this type of a monstrosity in Django, but not easy. A better solution, if your table’s requirements change is to use migrations (included in Django 1.7), or a more flexible store, like JSON or Hstore. This also has the added advantage of being a column, rather than a related table, which means you can fetch it in one go, simply. Similarly, with Postgres 9.3, you can do all sorts of querying, and even more in 9.4.

Document or key stores are no substitute for proper attributes, but they do have their uses.

The other solution is to use Model inheritance, which Django does well. You can choose either abstract or concrete table inheritance, and with something like django-model-utils, even get some nice features like fetching only the subtypes when fetching a queryset of superclass models.

Use Dual-Purpose Foreign Key

Unfortunately, Django comes with a built-in way to do this: so-called Generic Relations.

Using this, it’s possible to have an association from a given model instance to any other object of any other model class.

“You may find that this antipattern is unavoidable if you use an object-relational programming framework […]. Such a framework may mitigate the risks introduced by Polymorphic Associations by encapsulating application logic to maintain referential integrity. If you choose a mature and reputable framework, then you have some confidence that its designers have written the code to implement the association without error.”

I guess we’ll just have to rely on the fact Django is a mature and reputable framework.

In all reality, I’ve used this type of relation once: for notifications that need to be able to refer to any given object. It’s also possible to use, say, a tagging app that had generic relations. But, I’m struggling to think of too many situations where it would be better than a proper relation.

I’ve also come across it in django-reversion, and running queries against objects from it is a pain in the arse.

Create Multiple Columns

Interestingly, the example for this Antipattern is the example I just used above: tags. And, this type of situation should be done in a better way: a proper relation, or perhaps an Array type. It all depends how good your database is at querying arrays. django.contrib.postgres makes this rather easy:

class Post(models.Model):
    name = models.CharField(...)
    tags = ArrayField(models.CharField(...), blank=True)

Post.objects.filter(tags__contains=['foo'])

What may not be so easy is getting all of the tags in use. This may be possible: I just haven’t thought of a way to do this yet. A nice syntax might be:

Post.objects.aggregate(All('tags'))

The SQL you might be able to use to get this could look like:

SELECT
  array_agg(distinct t) AS tags
FROM (
  SELECT unnest(tags) FROM posts
) t;

I’m not sure if there’s a better way to get this data.

Clone Tables or Columns

I can’t actually see that doing this in Django would be easy, or likely. It’s gotten me interested in some method of seamlessly doing Horizontal Partitioning as a method of archiving old data, and perhaps moving it to a different database. Specifically, moving old audit data into a separate store may become necessary at some point.

Partitioning using a multi-tenancy approach using Postgres’ schemata is another of my interests, and I’ve been working on a django-specific way to do this: django-boardinghouse. Note, this is a partial-segmentation approach, where some tables are shared, but others are per-schema.

Physical Database Design Antipatterns

Use FLOAT Data Type

Just don’t.

There’s a DecimalField, and no reason not to use it.

Specify Values in the Column Definition

The example the book uses is to define check constraints on a given question. Django’s approach is a bit different: the valid choices are defined in the column definition, but can be changed in code at any time. Any existing values that are no longer valid are fine, but any attempt to save an object will require it to have one of the newly valid choices.

This is both better and worse than the problem described in the book. There’s no way (short of a migration) to change the existing data, but maybe that’s actually just better.

Again, the best solution is just to use a related field, but in some cases this is indeed overkill: specifically if values are unlikely to change.

Assume You Must Use Files

I’m still 50-50 on this one. Basically, storing binary files in your database (a) makes the database much bigger, which means it takes longer to back it up (and restore it), and (b) means that it’s harder to do things like use the web server, rather than the application server, to serve static files (even those user-supplied, that must be authenticated).

The main disadvantage, of not having backups, is purely an operations issue.

The secondary disadvantage: the lack of transactionality is also easily solved: don’t delete files (unless necessary), and don’t overwrite them. If you really must, then use a Postres NOTIFY delete-file <filepath> or similar, and have a listener that handles that.

The other disadvantage, about SQL privilidges is mostly moot under Django anyway, as you are always running as the one database user.

Using Indexes Without a Plan

Indexes are fairly tangiential to an ORM: I’m going to pass over this one without too much comment. I’ve been doing a fair bit of index-level optimisations on my production database lately, in an effort to improve performance. Mostly, it’s better to optimise the query, as the likely targets for indexes probably already have them.

Query Antipatterns

Use Null as an Ordinary Value, or Vice Versa.

Python has it’s own None type/value, and using it in queries basically converts it into NULL. Django is a little annoying how at times it stores empty strings instead of NULL in string fields. I was playing around with making these into proper NULLs, but it seemed to create other problems.

At least there is no established pattern to use other values instead of NULL.

Reference Non-grouped Columns

Since I’m dealing with Postgres, I understand this one is not much of an issue. Your query will fail if you build it wrong. Which should be the way databases work.

Sort Data Randomly

Read this chapter online.

The problem of how to fetch a single random instance from a Model comes up every now and then on IRC, indeed, it did again last weekend. Unsurprisingly, I provided a link to this chapter.

One solution that is presented in the book is to select a single row, using a random offset:

import random
# Note: the initial version of this would fail since queryset.count()
# is the number of elements, randint(a, b) includes the value 'b',
# and queryset[b] would be out of range.
index = random.randint(0, queryset.count() - 1)
instance = queryset.all()[index]

This, converts to the query:

SELECT * FROM "table" LIMIT 1 OFFSET %s;

However, without an ordering, I believe this will still do a complete table seek. Instead, you want to order on a column with an index. Like the primary key:

instance = queryset.order_by('pk')[index]

It does take two queries, but sometimes two queries is better than one. Obviously, if your table was always going to be small, it may be better to do the random ordering:

instance = queryset.order_by('?')[0]

Pattern Matching Predicates

I’m sorry to say Django makes it far too easy to do this:

queryset.filter(foo__contains='bar')

Becomes something like:

SELECT * FROM "table" WHERE "table"."foo" LIKE '%bar%';

In many cases, this will be fine, but as you can imagine, you may get surprising matches, or performance may really suck.

Using Postgres’s full-text search is relatively simple: you can quite easily make a custom field that handles this, and with Django 1.7 or later, you can even create your own lookups:

from django.db import models


class TSVectorField(models.Field):
    def db_type(self, connection):
        return 'tsvector'


class TSVectorMatches(models.lookups.BuiltinLookup):
    lookup_name = 'matches'
    def process_lhs(self, qn, connection, lhs=None):
        lhs = lhs or self.lhs
        return qn.compile(lhs)

    def get_rgs_op(self, connection, rhs):
        return '@@ to_tsquery(%s)' % rhs

TSVectorField.register_lookup(TSVectorMatches)

Then, you are able, on a correctly defined field, able to do:

queryset.filter(foo__matches='bar')

Which roughly translates to:

SELECT * FROM "table" WHERE (foo @@ to_tsquery('bar'));

It’s actually a little more complicated than that, but I have a working prototype at https://bitbucket.org/schinckel/django-postgres/. There is a field class, but also an example within the search sub-app.

Clearly, you’ll want to be creating the right indexes.

Solve a Complex Problem in One Step

By their very nature, ORMs tend to make this a little less easy to do. Because you don’t normally write custom code, this scenario is less common than you might see in a normal SQL access.

However, with Django, it is possible to write over-complicated queries, but also to use things like .raw(), and .extra() to write “Spaghetti Queries”.

However, it is worth noting that with judicious use of these features, you can indeed write queries that perform exceptionally well, indeed, far better than the ORM is able to generate for you. It’s also worth noting that you can write really, really bad queries that take a very long time, just using the ORM (without even doing things like N+1 queries for related objects).

Indeed, the “how to recognize” section of this chapter shows the biggest red flag I have noticed lately: “Just stick another DISTINCT in there”.

I’ve seen, first-hand how a .distinct() can cause a query to take a very long period of time. Removing the need for a distinct by removing the join, and instead using subqueries, caused a query that was taking around 17 seconds with a given data set to suddenly take less than 200ms.

That alone has forced me to reconsider each and every time I use .distinct() in my code (and probably explains why our code that runs queries against django-reversion) performs so horribly.

A Shortcut That Gets You Lost

I’ve used, in my SQL snippets in this post, the shortcut that is mentioned here: SELECT * FROM .... Luckily, Django doesn’t use this shortcut, and instead lists out every column it expects to see.

This has a really nice side-effect: if your database tables have not been migrated to add that new column, then whenever you try to run any queries against that table, you will have an error. Which is much more likely to happen immediately, rather than at 3am when that column is first actually used.

Application Development Antipatterns

Store Password in Plain Text

There is no, I repeat, no reason you should ever be doing this. It’s a cardinal sin, and Django has a great authentication and authorisation framework, that you can extend however you need it.

As noted in the legitimate uses section: if you are accessing a third-party system, you may need to store the password in a readable format. In this case, something like Oauth, if available, may make things a little safer.

Execute Unverified Input As Code

Read this chapter online.

Most of the risks of SQL Injection are mitigated when you use an ORM like Django’s. Of course, if you write .raw() or .extra() queries that don’t properly escape user-provided data, then you may still be at risk. .extra() in particular has arguments that allow you to pass an iterable of parameters, which will then be correctly escaped as they are added to the query.

Filling in the Corners

Educate your manager if (s)he thinks it’s a bad thing to have non-contiguous primary keys. Transaction rollbacks, deleted objects: there’s all sorts of reasons why there may be gaps.

Making Bricks Without Straw

It goes without saying that you should have error handling within your python code.

Make SQL a Second-Class Citizen

This is kind-of the point of an ORM: to remove from you the need to deal with creating complex queries in raw SQL.

Your Django models are the documentation of your table structure, or documentation can be generated from them. Your migrations files show the changes that have been made over time. Naturally, both of these will be stored in your Source Code Management system.

Clearly, as soon as you are doing anything in raw SQL, then you should follow the practices you do with the rest of your code.

Testing in-database is something I am a little bit interested in. As I move more code into the database (often for performance reasons, sometimes because it’s just fun), it would be nice to have tests for these functions. I have a long list of things in my Reading List about Postgres Unit Testing. Perhaps I’ll get around to them at some point. Integrating these with the Django test runner would be really neat.

The Model Is an Active Record

Django’s use of the Active Record is slightly different to Rails. In Rails, the column types in the database control what attributes are on the model, but in Django, the python object is the master. I think this is more meaningful, because it means that everything you need to know about an object is in the model definition: you don’t need to follow the migrations to see what attributes you have.

I do like the concept of a Domain Model: it’s an approach I’ve lightly tried in the past. Perhaps it is an avenue I’ll push down further at some point. In some ways, Django’s Form classes allow you to encapsulate this, but mostly business logic still lives on our Model classes.

Summary

So, how did Django do?

Pretty good, I’d say. The ones that were less successful either don’t really matter most of the time (primary key column is always called id, choices defined in the model), or you don’t really need to use them (Generic Relations, searching using LIKE %foo%, using raw SQL).

We do fall down a bit with files stored in the database, and fat models, but I would argue that those patterns work just fine, at least for me right now.

Trust your tools, or how django's ORM bested me

Within my system, there is a complicated set of rules for determining if a person is “inactive”.

They may have been explicitly marked as inactive, or their company may have been marked as inactive. These are simple to discover and filter to only get active people:

Person.objects.filter(active=True, company__active=True)

The other clause for inactive users is if they only work at locations that have been marked as inactive. This means we can disable a location (within a company that remains active), and not have to manually deactivate the staff who only work at that location; it also means when we reactivate a location, staff will automatically be restored to an active state.

I’ve written the code several times that determines the activity status, but have never really been that happy with it. It generally degenerates into something that uses N+1 queries to discover the activity status of N people, or requires using django’s queryset.extra() method to run queries within the database.

Now, I have a cause to fetch all active staff, from the entire system. Which I had written a query to do, but it was mistakenly including staff who are only active at inactive units. I tried playing around with .extra(select={...}), but was not able to filter on the pseudo-fields that were generated.

Then, I had the idea to do the following:

active = Location.objects.active()
inactive = Location.objects.inactive()
Person.objects.filter(
  Q(locations__in=active) | ~Q(locations__in=inactive)
)

As long as the objects active and inactive are querysets, they will be lazily evaluated, and the SQL that is generated is relatively concise:

SELECT ... 
FROM "people" 
LEFT OUTER JOIN "people_locations" 
ON ("people"."id" = "people_locations"."person_id") 
WHERE (
  "people_locations"."location_id" IN (
    SELECT U0."id" FROM "location" U0 WHERE U0."status" = 0
  )
  OR NOT ((
    "people"."id" IN (
      SELECT U1."person_id" FROM "people_locations" U1 WHERE (
        U1."location_id" IN (
          SELECT U0."id" FROM "location" U0 WHERE U0."status" = 1
        )
        AND U1."person_id" IS NOT NULL
      )
    ) 
    AND "people"."id" IS NOT NULL)
  )
)
ORDER BY "..." ASC

This is much better than how I had previously done it, and has the bonus of being db-agnostic: wheras my previous solution used Postgres ARRAY types to aggregate the statuses of locations into a list.

The moral of the story: trust your high-level abstraction tools, and use them first. If you still have performance issues, then look at optimising.