Versioning complex database migrations

Recently, I’ve been writing lots of raw SQL code that is either a complex VIEW, or a FUNCTION. Much of the time these will be used as the “source” for a Django model, but not always. Sometimes, there are complex functions that need to be run as a trigger in Postgres, or even a rule to execute when attempting a write operation on a view.

Anyway, these definitions are all code, and should be stored within the project they belong to. Using Django’s migrations you can apply them at the appropriate time, using a RunSQL statement.

Hovewer, you don’t really want to have the raw SQL in the migration file. Depending upon the text editor, it may not syntax highlight correctly, and finding the correct definition can be difficult.

Similarly, you don’t want to just have a single file, because to recreate the database migration sequence, it needs to apply the correct version at the correct time (otherwise, other migrations may fail to apply).

Some time ago, I adopted a policy of manually versioning these files. I have a pattern of naming, that seemed to be working well:

special_app/
  migrations/
    __init__.py
    0001_initial.py
    0002_update_functions.py
  sql/
    foo.function.0001.sql
    foo.function.0002.sql
    foo.trigger.0001.sql
    bar.view.0001.sql

The contents of the SQL files are irrelevant, and the migrations mostly so. There is a custom migration operation I wrote that loads the content from a file:

    LoadSQLScript('special_app', 'foo.function', 1)

The mechanics of how it works are not important.

So, this had been working well for several months, but I had a nagging feeling that the workflow was not ideal. This came to a head in my mind when I recognised that doing code review on database function/view changes was next to impossible.

See, the problem is that there is a completely new file each time you create a new version of a definition.

Instead, I’ve come up with a similar, but different solution. You still need to have versioned scripts for the purpose of historical migrations, but the key difference is that you don’t actually write these. Instead, you have something that notices that the “current” version of a definition is different to the latest version snapshot. You then also have a tool that copies the current version to a new snapshot, and creates a migration.

You can still modify a snapshot (for instance, if you’ve already created one, but it’s only in your current development tree), but mostly you don’t need to.

$ ./manage.py check
System check identified some issues:

WARNINGS:
?: (sql_helpers.W002) The versioned migration file for core: iso8601.function.sql is out of date,
and needs to be updated or a new version created.

Oh, thanks for that. Checking the file, I see that it does indeed need a new version:

$ ./manage.py make_sql_migrations core
...
Copied <project>/core/sql/iso8601.function.sql to version 0002

You still need to ensure that any dependencies between SQL files are dealt with appropriately (for instance, a function that relies on a newly added column to a view needs to have that view’s definition updated before the function can be updated). But this is a much smaller problem, and something that your database should complain about when you try to apply the migrations.

I haven’t packaged this up yet, it’s currently only an internal app, as I want to use it a bit for the next little while and see what else shakes out. I was pretty sure the way I was doing it before was “the best way” just after I thought that up, so we’ll see what happens.

JavaScript Array Widget

I’ve been making more and more use of the django.contrib.postgres classes, and will often store data in an ArrayField where appropriate.

There are two form fields that are supplied with Django for handling these types: one of which has the array values in a single text input (comma separated), and the other has a different text input element for each value.

However, the latter does not really work that well with a dynamic length array (it could work with up to N items, but in my case, I really don’t often have an N).

It could be possible to build similar functionality that you see with the Django Admin’s formset handling like here, however this turns out to be lots of mucking around.

It might be simpler to just have the simple array field rendered, and have JS bolted on that builds the dynamic list of text inputs based on this.

In this instance, I am actually storing the state in the widgets themselves: this means it’s relatively easy to add in the ability to re-order. I’ve done this with the Sortable library.

Django Dynamic Formsets

Django forms are one of the most important parts of the stack: they enable us to write declarative code that will validate user input, and ensure we protect ourselves from malicious input.

Formsets are an extension of this: they deal with a set of homogeous forms, and will ensure that all of the forms are valid independently (and possibly do some inter-form validation, but that’s a topic for a later day).

The Django Admin contains an implementation of a dynamic formset: that is, it handles adding and removing forms from a formset, and maintains the management for accordingly. This post details an alternative implementation.


A Formset contains a Form (and has zero or more instances of that Form). It also contains a “Management Form”, which has metadata about the formset: the number of instances of the form that were provided initially, the number that were submitted by the user, and the maximum number of forms that should be accepted.

A Formset has a “prefix”, which is prepended to each element within the management form:

<input type="hidden" name="prefix-INITIAL_FORM_COUNT" value="...">
<input type="hidden" name="prefix-TOTAL_FORM_COUNT" value="...">
<input type="hidden" name="prefix-MIN_NUM_FORM_COUNT" value="...">
<input type="hidden" name="prefix-MAX_NUM_FORM_COUNT" value="...">

Each Form within the Formset uses the prefix, plus it’s index within the list of forms. For instance, if we have a Formset that contains three forms, each containing a single “name” field, we would have something similar to:

<input type="text" name="prefix-0-name" value="Alice">
<input type="text" name="prefix-1-name" value="Bob">
<input type="text" name="prefix-2-name" value="Carol">

Note that the form’s prefix is <formset_prefix>-<form_index>.

To make a Formset dynamic, we just need to be able to add (and possibly remove, but there’s a little more complexity there) extra forms. The managment form needs to be updated to reflect this, and we need to ensure that the new form’s fields are named appropriately.

A Formset also contains an empty_form. This is an unbound form, where the form’s “index” is set to __prefix__. Thus, the empty form for the above formset might look somewhat like:

<input type="text" name="prefix-__prefix__-name" value="">

We can leverage this to allow us to have simpler code: instead of having to duplicate elements and remove the value, we can just duplicate the empty form, and replace the string __prefix__ with whatever the index of the newly created form should be.

Here’s an implementation that has no dependencies, but does make some assumptions:

On Fences and Functions

I grew up on a farm.

We had fences on the farm.

Whilst the jobs associated with fences and fencing are less than fun, the fences themselves are extremely important. They keep the livestock in the correct location. When you have a damaged or incomplete fence, even if it is only damaged in a small way, it can cost significant amounts of money, even human lives. This can vary between keeping Rams from a flock of Ewes that you don’t want them to mate with (because you need to know which Ram mated with which Ewes in order to track progeny), to livestock escaping onto a public road and causing accidents.

Fences are a good thing.


My first career was as a Design and Technology Teacher.

We use fences in woodwork. They are attachments to fixed power tools, such as drill presses and circular saws. They allow us to work safely and to get accurate, easily repeatable results. For instance. we can use a fence to cut sheets of MDF to exactly the same width, ensuring the bookcase we are making is square. Without a fence, it can still be done, but it will certainly be much harder.

Fences are a good thing.


I’d heard people describe Postgres’s CTEs (Common Table Expressions) as an “optimisation fence”. Given my previous uses of the word “fence”, I assumed that this was widely regarded as a good thing.

However, after spending some time writing really complex queries (that are most easily described using a CTE), I happened to read PostgreSQL’s CTEs are optimisation fences. It had (throughout my work within Postgres) become plain to me that each term in a CTE is materialised (if it is referenced at all), before any filtering that might occur later would allow it to be filtered earlier. Postgres is pretty good about pushing these changes down into a sub-query, but it can mean that a CTE performs worse, as it might have to do more work. However, this article points this out in some detail, and it occurred to me that perhaps some people see fences (in general) as an obstacle. Perhaps fencing something in has negative connotations?

I’m not sure that that’s exactly what the author meant (I wonder if it was sarcasm, perhaps), but it did get me thinking about how different backgrounds could result in opposite interpretations of the same terms.


I do want to veer back a bit into a technical manner, and discuss how I have been overcoming the fact that it’s not possible to push the filtering back up the stack in a CTE.

Largely, the issue exists in my code because I have lots of complex queries (as I just mentioned) that are far easier to write, reason about and debug when written using a CTE. I would like to write them as a VIEW, and then stick Django models in front of them, and I’d be able to query them using the ORM, and just have the view as the db_table of the model. It would be really nice.

But this doesn’t work, because some of the queries require aggregating data across models of which there are millions of rows, and some of the database tables are less than optimal. For instance, I have several tables that store an effective_from field, and in the case of superseding, the same set of other fields (person, for instance) means we can know which one applies on a given date. However, to query this, we end up writing a more complex query (instead of being able to do a date <@ daterange query, if the valid period was stored in the table). I’ve learned from this in newer models, but some stuff is too deeply ingrained to be able to be changed just yet.

So, I have a VIEW that turns this into a data that actually contains dateranges, and I can query against that. But, if I use this in a CTE, then it can materialise the whole lot, which can be slow. So, I needed to come up with a way to filter the data earlier.

Functions.

I’ve been writing SQL functions that take parameters, and then filter as early as possible. This then means that it’s a real possibility that we can get <100ms queries for stuff that is really, really complicated (and joins a couple of dozen or more tables in really funky ways). It does mean I can’t query using the Django ORM, but that’s okay: the data I’m getting back doesn’t necessarily map onto a model anyway, and we need to use it as a dict.

More recently, I’ve extended this so that the function (with the relevant parameters, extracted out of the queryset WHERE clauses) can be used as the db_table for a Model. It’s still somewhat hacky, but is very interesting, nonetheless.

Weekday Multi-Select Widget

I knocked together a little widget that allows for selecting multiple days of the week.

Multi-table Inheritance and the Django Admin

Django’s admin interface is a great way to be able to interact with your models without having to write any view code, and, within limits, it’s useful in production too. However, it can quickly get very crowded when you register lots of models.

Consider the situation where you are using Django’s multi-table inheritance:

from django.db import models

from model_utils.managers import InheritanceManager

class Sheep(models.Model):
    sheep_id = models.AutoField(primary_key=True)
    tag_id = models.CharField(max_length=32)
    date_of_birth = models.DateField()
    sire = models.ForeignKey('sheep.Ram', blank=True, null=True, related_name='progeny')
    dam = models.ForeignKey('sheep.Ewe', blank=True, null=True, related_name='progeny')

    objects = InheritanceManager()

    class Meta:
        verbose_name_plural = 'sheep'

    def __str__(self):
        return '{}: {}'.format(self._meta.verbose_name, self.tag_id)


class Ram(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ram'
        verbose_name_plural = 'rams'


class Ewe(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ewe'
        verbose_name_plural = 'ewes'

Ignore the fact there is no specialisation on those child models: in practice you’d normally have some.

Also note that I’ve manually included the primary key, and the parent link fields. This has been done so that the actual columns in the database match, and in this case will all be sheep_id. This will make writing joins slightly simpler, and avoids the (not specific to Django) ORM anti-pattern of “always have a column named id”.

We can use the models like this, but it might be nice to have all sheep in the one admin changelist, and just allow filtering by subclass model.

First, we’ll put some extra stuff onto the parent model, to make obtaining the subclasses simpler. Some of these will use a new decorator, which creates a class version of the @property decorator.

class classproperty(property):
    def __get__(self, cls, owner):
      return self.fget.__get__(None, owner)()


class Sheep(models.Model):
    # Fields, etc. defined as above.

    @classproperty
    @classmethod
    def SUBCLASS_OBJECT_CHOICES(cls):
        "All known subclasses, keyed by a unique name per class."
        return {
          rel.name: rel.related_model
          for rel in cls._meta.related_objects
          if rel.parent_link
        }

    @classproperty
    @classmethod
    def SUBCLASS_CHOICES(cls):
        "Available subclass choices, with nice names."
        return [
            (name, model._meta.verbose_name)
            for name, model in cls.SUBCLASS_OBJECT_CHOICES.items()
        ]

    @classmethod
    def SUBCLASS(cls, name):
        "Given a subclass name, return the subclass."
        return cls.SUBCLASS_OBJECT_CHOICES.get(name, cls)

Note that we don’t need to enumerate the subclasses: adding a new subclass later in development will automatically add it to these properties, even though in this case it would be unlikely to happen.

From these, we can write some nice neat stuff to enable using these in the admin.

from django import forms
from django.conf.urls import url
from django.contrib import admin
from django.utils.translation import ugettext as _

from .models import Sheep


class SubclassFilter(admin.SimpleListFilter):
    title = _('gender')
    parameter_name = 'gender'

    def lookups(self, request, model_admin):
      return Sheep.SUBCLASS_CHOICES

    def queryset(self, request, queryset):
      if self.value():
        return queryset.exclude(**{self.value(): None})
      return queryset


@admin.register(Sheep)
class SheepAdmin(admin.ModelAdmin):
    list_display = [
        'tag_id',
        'date_of_birth',
        'gender'
    ]
    list_filter = [SubclassFilter]

    def get_queryset(self, request):
      return super(SheepAdmin, self).get_queryset(request).select_subclasses()

    def gender(self, obj):
        return obj._meta.verbose_name

    def get_form(self, request, obj=None, **kwargs):
        if obj is None:
            Model = Sheep.SUBCLASS(request.GET.get('gender'))
        else:
            Model = obj.__class__

        # When we change the selected gender in the create form, we want to reload the page.
        RELOAD_PAGE = "window.location.search='?gender=' + this.value"
        # We should also grab all existing field values, and pass them as query string values.

        class ModelForm(forms.ModelForm):
            if not obj:
                gender = forms.ChoiceField(
                    choices=[('', _('Please select...'))] + Sheep.SUBCLASS_CHOICES,
                    widget=forms.Select(attrs={'onchange': RELOAD_PAGE})
                )

            class Meta:
                model = Model
                exclude = ()

        return ModelForm

    def get_fields(self, request, obj=None):
        # We want gender to be the first field.
        fields = super(SheepAdmin, self).get_fields(request, obj)

        if 'gender' in fields:
            fields.remove('gender')
            fields = ['gender'] + fields

        return fields

    def get_urls(self):
        # We want to install named urls that match the subclass ones, but bounce to the relevant
        # superclass ones (since they should be able to handle rendering the correct form).
        urls = super(SheepAdmin, self).get_urls()
        existing = '{}_{}_'.format(self.model._meta.app_label, self.model._meta.model_name)
        subclass_urls = []
        for name, model in Sheep.SUBCLASS_OBJECT_CHOICES.items():
            opts = model._meta
            replace = '{}_{}_'.format(opts.app_label, opts.model_name)
            subclass_urls.extend([
                url(pattern.regex.pattern, pattern.callback, name=pattern.name.replace(existing, replace))
                for pattern in urls if pattern.name
            ])

        return urls + subclass_urls

Wow. There’s quite a lot going on there, but the summary is:

  • We create a custom filter that filters according to subclass.
  • The .select_subclasses() means that objects are downcast to their subclass when fetched.
  • There is a custom form, that, when in create mode, has a selector for the desired subclass.
  • When the subclass is changed (only on the create form), the page is reloaded. This is required in a situation where there are different fields on each of the subclass models.
  • We register the subclass admin url paths, but use the superclass admin views.

I’ve had ideas about this for some time, and have just started using something like this in development: in my situation, there will be an arbitrary number of subclasses, all of which will have several new fields. The code in this page is extracted (and changed) from those ideas, so may not be completely correct. Corrections welcome.

(Directly) Testing Django Formsets

Django Forms are excellent: they offer a really nice API for validating user input. You can quite easily pass a dict of data instead of a QueryDict, which is what the request handling mechanism provides. This makes it trivial to write tests that exercise a given Form’s validation directly. For instance:

def test_my_form(self):
    form = MyForm({
        'foo': 'bar',
        'baz': 'qux'
    })
    self.assertFalse(form.is_valid())
    self.assertTrue('foo' in form.errors)

Formsets are also really nice: they expose a neat way to update a group of homogenous objects. It’s possible to pass a list of dicts to the formset for the initial argument, but, alas, you may not do the same for passing data. Instead, it needs to be structured as the QueryDict would be:

def test_my_formset(self):
    formset = MyFormSet({
        'formset-INITIAL_FORMS': '0',
        'formset-TOTAL_FORMS': '2',
        'formset-0-foo': 'bar1',
        'formset-0-baz': 'qux1',
        'formset-1-foo': 'spam',
        'formset-1-baz': 'eggs'
    })
    self.assertTrue(formset.is_valid())

This is fine if you only have a couple of forms in your formset, but it’s a bit tiresome to have to put all of the prefixes, and is far noisier.

Here’s a nice little helper, that takes a FormSet class, and a list (of dicts), and instantiates the formset with the data coerced into the correct format:

def instantiate_formset(formset_class, data, instance=None, initial=None):
    prefix = formset_class().prefix
    formset_data = {}
    for i, form_data in enumerate(data):
        for name, value in form_data.items():
            if isinstance(value, list):
                for j, inner in enumerate(value):
                    formset_data['{}-{}-{}_{}'.format(prefix, i, name, j)] = inner
            else:
                formset_data['{}-{}-{}'.format(prefix, i, name)] = value
    formset_data['{}-TOTAL_FORMS'.format(prefix)] = len(data)
    formset_data['{}-INITIAL_FORMS'.format(prefix)] = 0

    if instance:
        return formset_class(formset_data, instance=instance, initial=initial)
    else:
        return formset_class(formset_data, initial=initial)

This handles a formset or a model formset. Much easier to use:

def test_my_formset(self):
    formset = instantiate_formset(MyFormSet, [
      {
        'foo': 'bar1',
        'baz': 'qux1',
      },
      {
        'foo': 'spam',
        'baz': 'eggs',
      },
    ])

Using other Python versions with Codeship.

Codeship is pretty cool, other than their requirement to log in to view even public builds. They support Python to some extent, even going as far as creating and activating a virtualenv for your test environment.

However, I like to use tox to do matrix testing against packages, and try to cover as many cases as possible. For instance, for django-boardinghouse, I currently test against:

  • Python 2.7
  • Python 3.3
  • Python 3.4
  • Python 3.5
  • pypy
  • pypy3

…and Django 1.7 through 1.9. In most cases, each version of python should be tested with each version of django. In practice, there are some exceptions.

However, Codeship only have Python 2.7.6 and 3.4.0 installed.

You can run arbitrary code as part of your test/setup, but you can’t install stuff using sudo. Instead, I wrote a script that can be called from within the test setup that installs other pythons:

# We already have some versions of python, but want some more...
cd ~/src

mkdir -p pypy
cd pypy
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.0.1-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy-5.0.1-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p pypy3
cd pypy3
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy3-2.4-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy3-2.4-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p ~/.local
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tar.xz
tar xvf Python-3.5.1.tar.xz
cd Python-3.5.1
./configure --prefix=/home/$USER/.local/
make
make install

# You actually need to put this line in the tests section. Not sure of a better solution.
export PATH=$PATH:~/src/pypy3/bin:~/src/pypy/bin:~/.local/bin/

I have this as a reusable snippet on BitBucket: codeship helper scripts, however as mentioned you need to grab the export PATH=... section and stick that in the tests section. Also notably you get a different URL for the raw version of each revision, which is actually really good, because it means someone cannot change the code between you inspecting it an executing it.

In my case, I have a line in the test setup:

curl https://bitbucket.org/\!api/2.0/snippets/schinckel/oKXKy/c7cc02bcd96d4a8f444cd997d5c3bc0bb92106d6/files/install-python.sh | sh

Also of note is that pypy* have a pre-built version, which is much faster than building from source, however there doesn’t seem to be a non-rpm version of Python 3.5.

Token Input Widget

I spent a fair bit of time on the weekend tweaking a Token Input widget.

I really like the way I show/hide the editing field, using no JS at all. That’s not to say there isn’t quite a bit of JS, but I am happy to say it’s using no jQuery, Knockout or anything else.

Indeed, I first wrote it in Knockout, but I think this one is actually simpler (and has some behaviour I wasn’t able to obtain with my Knockout models).

Django Trees via Closure View

After writing up a method of using a Postgres View that generates a materialised path within the context of a Django model, I came across some queries of my data that were getting rather troublesome to write. It occurred to me that having a closure table would be useful. Specifically, I needed all of the descendants of a given set of nodes.

I couldn’t find an existing Postgres extension that will manage the closure table, and didn’t feel like writing my own implemention using triggers just yet. However, it occurred to me that I could use a similar trick to the recursive materialised path view. Thus, we have a Closure View.

We will start with the Django models:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.Node', related_name='children', null=True, blank=True)

    descendants = models.ManyToManyField('tree.Node', related_name='ancestors', through='tree.Closure')

    class Meta:
        app_label = 'tree'


class Closure(models.Model):
    path = ArrayField(base_field=models.IntegerField(), primary_key=True)
    ancestor = models.ForeignKey('tree.Node', related_name='+')
    descendant = models.ForeignKey('tree.Node', related_name='+')
    depth = models.IntegerField()

    class Meta:
        app_label = 'tree'
        managed = False

You may notice I have a path column. I’m using this for the primary key, and it may turn out to be useful later.

Let’s have a look at the View:

CREATE RECURSIVE VIEW tree_closure(path, ancestor_id, descendant_id, depth) AS

SELECT ARRAY[node_id], node_id, node_id, 0 FROM tree_node

UNION ALL

SELECT parent_id || path, parent_id, descendant_id, depth + 1
FROM tree_node INNER JOIN tree_closure ON (ancestor_id = node_id)
WHERE parent_id IS NOT NULL;

This uses a recursive query. The first part builds the self-reference relations, and the second part uses the RECURSIVE function to collect child nodes for each node already in the table (or added in previous iterations of this part of the view).

Now, because we are using the in-built Django Many to Many features, we have some nice queries ready to go:

  • node.ancestors.all() : All ancestors of a given Node instance.
  • node.descendants.all() : All descendants of a given Node instance.
  • Node.objects.filter(ancestors=queryset) : All descendants of all nodes in a queryset.
  • Node.objects.filter(descendants=queryset) : All ancestors of all nodes in a queryset.

Of particular note are the bottom two: these are rather cumbersome to write in older versions of Django.