Using Postgres Composite Types in Django

Note: this post turned out to be far more complicated than I had hoped. I may write another one that deals with a less complicated type!

Postgres comes with a pretty large range of column types, and the ability to use these types in an ARRAY. There’s also JSON(B) and Hstore, which are useful for storing structured (but possibly varying) data. Additionally, there are also a range of, well, range types.

However, sometimes you actually want to store data in a strict column, but that isn’t a simple scalar type, or one of the standard range types. Postgres allows you to define your own composite types.

There is a command CREATE TYPE that can be used to create an arbitrary type. There are four forms: for now we will just look at Composite Types.

We will create a Composite type that represents the opening hours for a store, or more specifically, the default opening hours. For instance, a store may have the following default opening hours:

+------------+--------+---------+
|    Day     |  Open  |  Close  |
+------------+--------+---------+
|  Monday    |  9 am  |  5 pm   |
|  Tuesday   |  9 am  |  5 pm   |
|  Wednesday |  9 am  |  5 pm   |
|  Thursday  |  9 am  |  9 pm   |
|  Friday    |  9 am  |  5 pm   |
|  Saturday  | 10 am  |  5 pm   |
|  Sunday    | 11 am  |  5 pm   |
+------------+--------+---------+

During the Christmas season this store may be open longer (perhaps even 24 hours). There may also be differences at Easter time, or other public holidays, where the store is closed, or closes early.

It would be nice to be able to store the default opening hours for a store, and then, when creating a week, use these to create concrete (TIMESTAMP) values for each day, which could be overridden on any given day.

There are a few ways we could model this. Postgres has no timerange type, so that’s out. We could create a RANGE type, or we could use (start-time, finish-time). But what about when a store is open after midnight, or for 24 hours? Storing this data implicitly is a real pain, because you need to always check to see if the finish time is less than (or equal to) the start time whenever doing anything. Trust me, this is not the best approach.

An alternative I’ve been toying with is (start-time, interval). You could limit it so that the interval’s maximum is '1 day', but not (from what I can tell) when you define the type. Anyway, the syntax for creating this type is:

CREATE TYPE opening_hours AS (
  start time,
  length interval
);

As an aside, every table in the database also has an associated type (of the same name as the table).

Now, we have our type: we can use it in a table:

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT
);

CREATE TABLE default_opening_hours (
  store_id INTEGER REFERENCES store (store_id),
  monday opening_hours,
  tuesday opening_hours,
  wednesday opening_hours,
  thursday opening_hours,
  friday opening_hours,
  saturday opening_hours,
  sunday opening_hours
);

An alternative way of storing this information might be to use an array of opening_hours, directly on the store model. We’ll use this one instead, as it’s a little neater (and means we will look at how to use opening_hours[] later too).

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT,
  default_opening_hours opening_hours[7]
);

Now, we can put data in there:

INSERT INTO store (name, default_opening_hours) VALUES
(
  'John Martins',
  ARRAY[
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '12:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('10:00', '07:00')::opening_hours,
    ('11:00', '06:00')::opening_hours
  ]
);

Note how we need to cast all of the values from record to opening_hours.


In practice, we would probably also want to have some type of restriction where the opening time from one day, plus the default open hours is less than or equal to the starting time on the next day. I’m still not sure of the best way to do this in Postgres, but it is possible to do it in Django.


Speaking of Django, we want to be able to access this data type there. We can leverage a really nice feature of Psycopg2 to have these values automatically turned into a Python namedtuple. We do this by registering the type within Psycopg2, using the Django cursor.

from django.db import connection
from psycopg2.extras import register_composite

register_composite('opening_hours', connection.cursor().cursor)

But, this is only half of the pattern. We also need to register an adapter so that values going back the other way are also automatically cast into opening_hours.

from django.db import connection
from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs

# Get a reference to the namedtuple class
OpeningHours = register_composite(
  'opening_hours',
  connection.cursor().cursor,
  globally=True
).type

def adapt_opening_hours(value):
  return AsIs("(%s, %s)::opening_hours" % (
    adapt(value.start).getquoted(),
    adapt(value.length).getquoted()
  ))

register_adapter(OpeningHours, adapt_opening_hours)

Now, we can fetch data from the database, and know that we will get OpeningHours instances, and, when passing an OpeningHours instance back to the database, know it will be converted into the correct type.

Obviously, in order to do this, the type must exist in the database. We did that manually in this case. In a real situation you would want to do that as a database migration. And that is where things get tricky. You can’t run the register_adapter function until the type exists in the database. I did come up with a relatively neat workaround for this when writing a framework for generic Composite fields, where the registration of the composite type attempts to execute, and if it fails, it stores the data for later registration, and then the actual migration operation fires off a signal, which is handled by a listener that actually performs the registration.

The final piece of the puzzle is the Django Field subclass, which is actually not that complicated. In essence, we are relying on Psycopg to handle the adaptation in both directions, so it can be a bare field (perhaps with a formfield method to get a custom form field). In practice, I wrote the generic CompositeField subclass, which uses some metaclass magic to handle the late registration:

from django.db.models import fields
from django.db import connection
from django.dispatch import receiver, Signal

from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs
from psycopg2 import ProgrammingError


_missing_types = {}

class CompositeMeta(type):
    def __init__(cls, name, bases, clsdict):
        super(CompositeMeta, cls).__init__(name, bases, clsdict)
        cls.register_composite()

    def register_composite(cls):
        db_type = cls().db_type(connection)
        if db_type:
            try:
                cls.python_type = register_composite(
                    db_type,
                    connection.cursor().cursor,
                    globally=True
                ).type
            except ProgrammingError:
                _missing_types[db_type] = cls
            else:
                def adapt_composite(composite):
                    return AsIs("(%s)::%s" % (
                        ", ".join([
                            adapt(getattr(composite, field)).getquoted() for field in composite._fields
                        ]), db_type
                    ))

                register_adapter(cls.python_type, adapt_composite)


class CompositeField(fields.Field):
    __metaclass__ = CompositeMeta
    """
    A handy base class for defining your own composite fields.

    It registers the composite type.
    """


composite_type_created = Signal(providing_args=['name'])

@receiver(composite_type_created)
def register_composite_late(sender, db_type, **kwargs):
    _missing_types.pop(db_type).register_composite()

We also want to have a custom migration operation:

from django.db.migrations.operations.base import Operation

# Or wherever the code above is located.
from .fields.composite import composite_type_created


class CreateCompositeType(Operation):
    def __init__(self, name=None, fields=None):
        self.name = name
        self.fields = fields

    @property
    def reversible(self):
        return True

    def state_forwards(self, app_label, state):
        pass

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('CREATE TYPE %s AS (%s)' % (
            self.name, ", ".join(["%s %s" % field for field in self.fields])
        ))
        composite_type_created.send(sender=self.__class__, db_type=self.name)

    def state_backwards(self, app_label, state):
        pass

    def database_backwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('DROP TYPE %s' % self.name)

This is a bit manual, however. You need to create your own migration that creates the composite type, and then begin to use the field.

# migrations/XXXX_create_opening_hours.py

class Migration(migrations.Migration):
    dependencies = []

    operations = [
        CreateCompositeType(
            name='opening_hours',
            fields=[
                ('start', 'time'),
                ('length', 'interval')
            ],
        ),
    ]

The place this pattern falls down is that this migration must be manually created: we don’t have any way to automatically create the migration from the Field subclass, which just looks like:

class OpeningHoursField(CompositeField):

    def db_type(self, connection):
        return 'opening_hours'

    def formfield(self, **kwargs):
        defaults = {
            'form_class': OpeningHoursFormField
        }
        defaults.update(**kwargs)
        return super(OpeningHoursField, self).formfield(**defaults)

I think in the future I’ll attempt to use further metaclass magic to allow defining the fields of the Composite type. This could then be used to automatically create a form field (which is a subclass of forms.MultiValueField).

class OpeningHoursField(CompositeField):
    start = models.DateField()
    length = IntervalField()

    def db_type(self, connection):
        return 'opening_hours'

However, in the meantime, we can still get by. I’m not sure it’s going to be possible to inject extra operations into the migration based upon the field types anyway.

Finally, we can use this in a model:

class Store(models.Model):
    store_id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=128)
    default_opening_hours = ArrayField(
        base_field=OpeningHoursField(null=True, blank=True),
        size=7
    )

I’ve used the ArrayField from django.contrib.postgres, purely for illustration purposes.

The CompositeField and associated operation are part of my django-postgres project: once I have worked out some more kinks, I may submit a pull request to django.contrib.postgres, unless someone else beats me to it.

Oh, and a juicy little extra. Above I mentioned something about preventing overlaps. The logic I use in my form is:

from django import forms
from django.utils.translation import string_concat, ugettext_lazy as _

import postgres.forms

from .fields import OpeningHoursFormField
from .models import Store


def finish(obj):
    "Given an OpeningHours value, get the finish time"
    date = datetime.date(1, 1, 1)
    return (datetime.datetime.combine(date, obj.start) + obj.duration).time()


class StoreForm(forms.ModelForm):
    OVERLAPS_PREVIOUS = _('Open hours overlap previous day.')

    default_opening_hours = postgres.forms.SplitArrayField(
        base_field=OpeningHoursFormField(required=False),
        size=7,
    )

    class Meta:
        model = Store

    def clean_default_opening_hours(self):
        opening_hours = self.cleaned_data['default_opening_hours']
        field = self.fields['default_opening_hours']

        # Ensure consecutive days do not overlap.
        errors = []

        for i in range(7):
            today = opening_hours[i]
            if today.start is None or today.duration is None:
                continue

            yesterday = opening_hours[(i + 6) % 7]

            if yesterday.start is None or yesterday.duration is None:
                continue

            if finish(yesterday) <= yesterday.start:
                if today.start < finish(yesterday):
                    errors.append(forms.ValidationError(
                        string_concat(
                          field.error_messages['item_invalid'],
                          self.OVERLAPS_PREVIOUS
                        ),
                        code='item_invalid',
                        params={'nth': i}
                    ))

        if errors:
            raise forms.ValidationError(errors)

        return opening_hours

I’m currently not displaying the duration/length: I dynamically calculate it based on the entered start/finish pair, but that’s getting quite complicated.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')

Postgres VIEW meet Django Model

Postgres VIEWs are a nice way to store a subset of a table in a way that can itself be queried, or perhaps slightly or radically changing the shape of your table. It has a fairly simple syntax:

CREATE VIEW "foo" AS
SELECT "bar", "baz", "qux"
FROM "corge"
WHERE "grault" IS NULL;

You may use any valid SELECT query as the source of a VIEW, including one that contains UNION or UNION ALL. You can use this form to create a view that takes two similarly formatted tables and combines them into one logical table. Note that for a UNION to work, the columns (and column types) must be identical between the two parts of the query. A UNION will do extra work to ensure all rows are unique: UNION ALL may perform better, especially if you know your rows will be unique (or you need duplicates).

By default, a Postgres VIEW is dynamic, and read-only. With the use of the CREATE MATERIALIZED VIEW form, it’s possible to have a cached copy stored on disk, which requires an UPDATE MATERIALIZED VIEW "viewname" in order to cause an update.

It’s also possible to create a writeable VIEW, but I’m not going to discuss those now.


There is a feature of Django that makes in really simple to use a VIEW as the read-only source of a Django Model subclass: managed = False.

Given the VIEW defined above, we can write a Model that will allow us to query it:

from django.db import models

class Foo(models.Model):
    bar = models.CharField()
    baz = models.CharField()
    qux = models.CharField()

    class Meta:
      managed = False

Psycopg2 also has the ability to automatically convert values as it fetches them, so you don’t even really need to set the fields as the correct type: but you will probably want to where possible, as an aid to code readability.

In my case, I was returning a two-dimensional ARRAY of TIMESTAMPTZ, but didn’t want to have to include the full code for an ArrayField. So, I just defined it as a CharField, and psycopg2 just gave me the type of object I actually wanted anyway.

There is one little catch, and the code above will not quite work.

Django requires a primary key, even though in this case it makes no sense. You could define any field as a primary key, include a relevant key field from the parent model, or even a dummy value that is the same on every row. Relying on the same psycopg2 trick as above, you could use <tablename>-<id> so as to ensure uniqueness, even though that is not normally a valid value for a Django AutoField.

You probably need to be a little careful here, as if you are doing comparisons, Django will test __class__ and pk for testing equality, so you could hurt yourself if you aren’t careful.

You may also want to prevent write access at the Django level. Overriding save() and delete() on the Model class would be a good start, as well as writing a custom Manager/QuerySet that does the same. You could raise an exception that makes sense, like NotImplemented, or just leave it as a database error.

Avoiding SQL Antipatterns using Django (and Postgres)

The book SQL Antipatterns is one of my favourite books. I took the opportunity to reread it on a trip to Xerocon in Sydney, and as usual it enlightened me to thing I am probably doing in my database interactions.

So, I’m going to look at these Antipatterns, and discuss how you can avoid them when using Django. This post is intended to be read with each chapter of the book. I’ve used the section headings, but instead of the chapter headings, I’ve used the Antipattern headings. They are still in the same order, though.

It seems the printed version of this book is on sale now: I’m tempted to buy a few extra copies for gifts. Ahem, cow-orkers.

Logical Database Design Antipatterns

Format Comma-Separated Lists

This one is pretty simple: use a relation instead of a Comma Separated field. In the cases described in the book, a ManyToManyField is in fact simpler than a Comma Separated field. Django gets a gold star here, both in ease of use, but also in documentation about relations.

However, there may be times when a relation is overkill, and a real array is better. For instance, when storing data related to which days of the week are affected by a certain condition, it may make sense to store it in this way.

But we can do better than a simple Comma Separated field. Storing the data in a Postgres Array means we can rely on the database to validate the data, and allows searching. Similarly, we could store it in JSON, too.

I’ve maintained a JSONField for Django, although it’s not easily queryable. However, an ArrayField is coming in Django 1.8. There are alternatives already available if you need to use one now. I’ve got a project to mostly backport the django.contrib.postgres features to 1.7: django-postgres.

Things like JSON, Array and Hstore are a better solution than storing other-delimitered values in a straight text column too. With Django 1.7, it became possible to have lookups, which can leverage the DBMS’ ability to query these datatypes.

Always Depend on One’s Parent

Read chapter online.

Straight into a trickier one! And, Django’s documentation points out how to create his type of relation, but does not call out the possible issues. This book is worth it for this section alone.

So, how do we deal with trees in Django?

We can use django-mptt. This gives us (from what I can see) the “Nested Sets” pattern outlined in the book, but under the name “Modified Preorder Tree Traversal”.

I’m quite interested in the idea of using a Closure Table, and there are a couple of projects with quite different approaches to this:

  • django-ctt: uses a Model class you inherit from.
  • django-ct: better documented, but uses an unusual pattern of a pseudo-manager-thing.

Knowing me, I’m probably going to spend some time building a not-complete implementation at some point.

Update: Whilst I haven’t built an implementation of a Closure Table, I did implement recursive queries for an Adjacency List.

One Size Fits All

Using a field id for all tables by default is probably one of the biggest mistakes I think Django makes. And, as we shall see, we can’t yet avoid them, for at least a subset of situations.

Indeed, Django can use any single column for the primary key, and doesn’t require the use of a key column of name id. So, in my mind, it would have been better to use the <tablename>_id, as suggested in the book. Especially since you may also access the primary key attribute using the pk shortcut.

class Foo(models.Model):
    foo_id = models.AutoField(primary_key=True)

However, it’s not currently possible to do composite primary keys (but may be soon), which makes doing the best thing for a plain ManyToManyField possible: indeed, you don’t control that table anyway, and if you remove the id column (and create a proper primary key), things don’t work. In practice, you can just ignore this issue, since you (mostly) don’t deal with this table, or the objects from it.

So, assuming we are changing the id column into the name suggested in the book, what does that give us?

Nothing, until we actually need to write raw SQL code, and specifically code that joins multiple tables.

Then, we are able to use a slightly less verbose way of defining the join, and not worry about duplicate columns named id:

SELECT * FROM foo_foo JOIN foo_bar USING (foo_id);

I’m still not sure if it’s actually worthwhile doing this or not. I’m going to start doing it, just to see whether there are any drawbacks (already found one in some of my own code, that hard-coded an id field), or any great benefits.

Leave out the Constraints

Within Django, it’s more work to create relations without the relevant constraints, and it’s not possible to create a table without a primary key, so we can just pass this one by with a big:

smile and wave, boys

Use a Generic Attribute Table

Again, it’s possible to create this type of a monstrosity in Django, but not easy. A better solution, if your table’s requirements change is to use migrations (included in Django 1.7), or a more flexible store, like JSON or Hstore. This also has the added advantage of being a column, rather than a related table, which means you can fetch it in one go, simply. Similarly, with Postgres 9.3, you can do all sorts of querying, and even more in 9.4.

Document or key stores are no substitute for proper attributes, but they do have their uses.

The other solution is to use Model inheritance, which Django does well. You can choose either abstract or concrete table inheritance, and with something like django-model-utils, even get some nice features like fetching only the subtypes when fetching a queryset of superclass models.

Use Dual-Purpose Foreign Key

Unfortunately, Django comes with a built-in way to do this: so-called Generic Relations.

Using this, it’s possible to have an association from a given model instance to any other object of any other model class.

“You may find that this antipattern is unavoidable if you use an object-relational programming framework […]. Such a framework may mitigate the risks introduced by Polymorphic Associations by encapsulating application logic to maintain referential integrity. If you choose a mature and reputable framework, then you have some confidence that its designers have written the code to implement the association without error.”

I guess we’ll just have to rely on the fact Django is a mature and reputable framework.

In all reality, I’ve used this type of relation once: for notifications that need to be able to refer to any given object. It’s also possible to use, say, a tagging app that had generic relations. But, I’m struggling to think of too many situations where it would be better than a proper relation.

I’ve also come across it in django-reversion, and running queries against objects from it is a pain in the arse.

Create Multiple Columns

Interestingly, the example for this Antipattern is the example I just used above: tags. And, this type of situation should be done in a better way: a proper relation, or perhaps an Array type. It all depends how good your database is at querying arrays. django.contrib.postgres makes this rather easy:

class Post(models.Model):
    name = models.CharField(...)
    tags = ArrayField(models.CharField(...), blank=True)

Post.objects.filter(tags__contains=['foo'])

What may not be so easy is getting all of the tags in use. This may be possible: I just haven’t thought of a way to do this yet. A nice syntax might be:

Post.objects.aggregate(All('tags'))

The SQL you might be able to use to get this could look like:

SELECT
  array_agg(distinct t) AS tags
FROM (
  SELECT unnest(tags) FROM posts
) t;

I’m not sure if there’s a better way to get this data.

Clone Tables or Columns

I can’t actually see that doing this in Django would be easy, or likely. It’s gotten me interested in some method of seamlessly doing Horizontal Partitioning as a method of archiving old data, and perhaps moving it to a different database. Specifically, moving old audit data into a separate store may become necessary at some point.

Partitioning using a multi-tenancy approach using Postgres’ schemata is another of my interests, and I’ve been working on a django-specific way to do this: django-boardinghouse. Note, this is a partial-segmentation approach, where some tables are shared, but others are per-schema.

Physical Database Design Antipatterns

Use FLOAT Data Type

Just don’t.

There’s a DecimalField, and no reason not to use it.

Specify Values in the Column Definition

The example the book uses is to define check constraints on a given question. Django’s approach is a bit different: the valid choices are defined in the column definition, but can be changed in code at any time. Any existing values that are no longer valid are fine, but any attempt to save an object will require it to have one of the newly valid choices.

This is both better and worse than the problem described in the book. There’s no way (short of a migration) to change the existing data, but maybe that’s actually just better.

Again, the best solution is just to use a related field, but in some cases this is indeed overkill: specifically if values are unlikely to change.

Assume You Must Use Files

I’m still 50-50 on this one. Basically, storing binary files in your database (a) makes the database much bigger, which means it takes longer to back it up (and restore it), and (b) means that it’s harder to do things like use the web server, rather than the application server, to serve static files (even those user-supplied, that must be authenticated).

The main disadvantage, of not having backups, is purely an operations issue.

The secondary disadvantage: the lack of transactionality is also easily solved: don’t delete files (unless necessary), and don’t overwrite them. If you really must, then use a Postres NOTIFY delete-file <filepath> or similar, and have a listener that handles that.

The other disadvantage, about SQL privilidges is mostly moot under Django anyway, as you are always running as the one database user.

Using Indexes Without a Plan

Indexes are fairly tangiential to an ORM: I’m going to pass over this one without too much comment. I’ve been doing a fair bit of index-level optimisations on my production database lately, in an effort to improve performance. Mostly, it’s better to optimise the query, as the likely targets for indexes probably already have them.

Query Antipatterns

Use Null as an Ordinary Value, or Vice Versa.

Python has it’s own None type/value, and using it in queries basically converts it into NULL. Django is a little annoying how at times it stores empty strings instead of NULL in string fields. I was playing around with making these into proper NULLs, but it seemed to create other problems.

At least there is no established pattern to use other values instead of NULL.

Reference Non-grouped Columns

Since I’m dealing with Postgres, I understand this one is not much of an issue. Your query will fail if you build it wrong. Which should be the way databases work.

Sort Data Randomly

Read this chapter online.

The problem of how to fetch a single random instance from a Model comes up every now and then on IRC, indeed, it did again last weekend. Unsurprisingly, I provided a link to this chapter.

One solution that is presented in the book is to select a single row, using a random offset:

import random
# Note: the initial version of this would fail since queryset.count()
# is the number of elements, randint(a, b) includes the value 'b',
# and queryset[b] would be out of range.
index = random.randint(0, queryset.count() - 1)
instance = queryset.all()[index]

This, converts to the query:

SELECT * FROM "table" LIMIT 1 OFFSET %s;

However, without an ordering, I believe this will still do a complete table seek. Instead, you want to order on a column with an index. Like the primary key:

instance = queryset.order_by('pk')[index]

It does take two queries, but sometimes two queries is better than one. Obviously, if your table was always going to be small, it may be better to do the random ordering:

instance = queryset.order_by('?')[0]

Pattern Matching Predicates

I’m sorry to say Django makes it far too easy to do this:

queryset.filter(foo__contains='bar')

Becomes something like:

SELECT * FROM "table" WHERE "table"."foo" LIKE '%bar%';

In many cases, this will be fine, but as you can imagine, you may get surprising matches, or performance may really suck.

Using Postgres’s full-text search is relatively simple: you can quite easily make a custom field that handles this, and with Django 1.7 or later, you can even create your own lookups:

from django.db import models


class TSVectorField(models.Field):
    def db_type(self, connection):
        return 'tsvector'


class TSVectorMatches(models.lookups.BuiltinLookup):
    lookup_name = 'matches'
    def process_lhs(self, qn, connection, lhs=None):
        lhs = lhs or self.lhs
        return qn.compile(lhs)

    def get_rgs_op(self, connection, rhs):
        return '@@ to_tsquery(%s)' % rhs

TSVectorField.register_lookup(TSVectorMatches)

Then, you are able, on a correctly defined field, able to do:

queryset.filter(foo__matches='bar')

Which roughly translates to:

SELECT * FROM "table" WHERE (foo @@ to_tsquery('bar'));

It’s actually a little more complicated than that, but I have a working prototype at https://bitbucket.org/schinckel/django-postgres/. There is a field class, but also an example within the search sub-app.

Clearly, you’ll want to be creating the right indexes.

Solve a Complex Problem in One Step

By their very nature, ORMs tend to make this a little less easy to do. Because you don’t normally write custom code, this scenario is less common than you might see in a normal SQL access.

However, with Django, it is possible to write over-complicated queries, but also to use things like .raw(), and .extra() to write “Spaghetti Queries”.

However, it is worth noting that with judicious use of these features, you can indeed write queries that perform exceptionally well, indeed, far better than the ORM is able to generate for you. It’s also worth noting that you can write really, really bad queries that take a very long time, just using the ORM (without even doing things like N+1 queries for related objects).

Indeed, the “how to recognize” section of this chapter shows the biggest red flag I have noticed lately: “Just stick another DISTINCT in there”.

I’ve seen, first-hand how a .distinct() can cause a query to take a very long period of time. Removing the need for a distinct by removing the join, and instead using subqueries, caused a query that was taking around 17 seconds with a given data set to suddenly take less than 200ms.

That alone has forced me to reconsider each and every time I use .distinct() in my code (and probably explains why our code that runs queries against django-reversion) performs so horribly.

A Shortcut That Gets You Lost

I’ve used, in my SQL snippets in this post, the shortcut that is mentioned here: SELECT * FROM .... Luckily, Django doesn’t use this shortcut, and instead lists out every column it expects to see.

This has a really nice side-effect: if your database tables have not been migrated to add that new column, then whenever you try to run any queries against that table, you will have an error. Which is much more likely to happen immediately, rather than at 3am when that column is first actually used.

Application Development Antipatterns

Store Password in Plain Text

There is no, I repeat, no reason you should ever be doing this. It’s a cardinal sin, and Django has a great authentication and authorisation framework, that you can extend however you need it.

As noted in the legitimate uses section: if you are accessing a third-party system, you may need to store the password in a readable format. In this case, something like Oauth, if available, may make things a little safer.

Execute Unverified Input As Code

Read this chapter online.

Most of the risks of SQL Injection are mitigated when you use an ORM like Django’s. Of course, if you write .raw() or .extra() queries that don’t properly escape user-provided data, then you may still be at risk. .extra() in particular has arguments that allow you to pass an iterable of parameters, which will then be correctly escaped as they are added to the query.

Filling in the Corners

Educate your manager if (s)he thinks it’s a bad thing to have non-contiguous primary keys. Transaction rollbacks, deleted objects: there’s all sorts of reasons why there may be gaps.

Making Bricks Without Straw

It goes without saying that you should have error handling within your python code.

Make SQL a Second-Class Citizen

This is kind-of the point of an ORM: to remove from you the need to deal with creating complex queries in raw SQL.

Your Django models are the documentation of your table structure, or documentation can be generated from them. Your migrations files show the changes that have been made over time. Naturally, both of these will be stored in your Source Code Management system.

Clearly, as soon as you are doing anything in raw SQL, then you should follow the practices you do with the rest of your code.

Testing in-database is something I am a little bit interested in. As I move more code into the database (often for performance reasons, sometimes because it’s just fun), it would be nice to have tests for these functions. I have a long list of things in my Reading List about Postgres Unit Testing. Perhaps I’ll get around to them at some point. Integrating these with the Django test runner would be really neat.

The Model Is an Active Record

Django’s use of the Active Record is slightly different to Rails. In Rails, the column types in the database control what attributes are on the model, but in Django, the python object is the master. I think this is more meaningful, because it means that everything you need to know about an object is in the model definition: you don’t need to follow the migrations to see what attributes you have.

I do like the concept of a Domain Model: it’s an approach I’ve lightly tried in the past. Perhaps it is an avenue I’ll push down further at some point. In some ways, Django’s Form classes allow you to encapsulate this, but mostly business logic still lives on our Model classes.

Summary

So, how did Django do?

Pretty good, I’d say. The ones that were less successful either don’t really matter most of the time (primary key column is always called id, choices defined in the model), or you don’t really need to use them (Generic Relations, searching using LIKE %foo%, using raw SQL).

We do fall down a bit with files stored in the database, and fat models, but I would argue that those patterns work just fine, at least for me right now.

Liquid Templates and Django Templates

Note to self: when I get the error:

/Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/convertible.rb:81:in `do_layout': undefined method `name' for #<Jekyll::Post:0x10f842e88> (NoMethodError)
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/post.rb:189:in `render'
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/site.rb:193:in `render'
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/site.rb:192:in `each'
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/site.rb:192:in `render'
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/../lib/jekyll/site.rb:40:in `_draft_process'
	from /Users/matt/Dropbox/Blog/_plugins/drafts.rb:15:in `process'
	from /Library/Ruby/Gems/1.8/gems/jekyll-0.11.2/bin/jekyll:250
	from /usr/local/bin/jekyll:19:in `load'
	from /usr/local/bin/jekyll:19

It’s probably because I have used django template tags in the {% (end)highlight %} blocks, and omitted the {% (end)raw %} stuff.

Leveraging HTML and Django Forms: Pagination of Filtered Results

Django’s forms are fantastic for parsing user input, but I’ve come up with a nice way to use them, in conjunction with HTML forms, for pagination, using the inbuilt Django pagination features.

It all stems from the fact that I’ve begun using forms quite heavily for GET purposes, rather than just for POST. Basically, anytime you have a URL that may have some parts of the query string that may need to be built, it’s simpler to use a form element, than to manually build up the url in your template.

Thus, where you may have something like:

<a href="{% url 'foo' %}?page={{ page }}">

It may be better do do something more like:

<form action="{% url 'foo' %}">
  <input type="hidden" name="page" value="{{ page }}">
</form>

Indeed, you can even use named buttons for submission, which will refer to the page. That is the key to the process outlined below.


Django comes with lots of “batteries”, including form handling and pagination. The Class Based Views (CBV) that deal with collections of objects will include pagination, although it is possible to use this pagination in your own views. For simplicity, we’ll stick with a simple ListView.

Let’s begin with that simple view: in our views.py:

from django.views.generic import ListView, DetailView

from .models import Person

person_list = ListView.as_view(
    queryset=Person.objects.all(),
    template_name='person/list.html',
    paginate_by=10,
)

person_detail = DetailView.as_view(
    queryset=Person.objects.all(),
    template_name='person/detail.html',
)

Essentially, that’s all you need to do. You could use implied template names, but I almost never do this. The takeaway from this block is that we are stating the queryset that our ListView will use as the base, the template it should render, and the number of items per page.

I’ve stubbed out the person_detail view, just so we can refer to it in our urlconf, and then in turn in the template. Because of the simplicity of it, we could have just done all of this in our urls.py.

Speaking of our urls.py, we have something like:

from django.conf.urls import url

import views

urlpatterns = [
    url(r'^people/$', views.person_list, name='person_list'),
    url(r'^people/(?P<pk>\d+)/', views.person_detail, name='person_detail'),
]

Then, in our template, we can render it as (ignoring the majority of the page):

<ul class="people">
  {% for object in object_list %}
    <li>
      <a href="{% url 'person_detail' pk=object.pk %}">
        {{ object }}
      </a>
    </li>
  {% endfor %}
</ul>

But this doesn’t give us our pagination. It will only show the first ten results, with no way to access the others. All we need to do to access the others is to append ?page=X, but, as we will see, there is another way.

Typically, your pagination block might look something like:

<ul class="pagination">
  <li>
    {% if page_obj.has_previous %}
      <a href="?page={{ page_obj.previous_page_number }}">
        prev
      </a>
    {% else %}
      <span>prev</span>
    {% endif %}
  </li>

  {% for page_number in paginator.page_range %}
    {% if page_number = page_obj.number %}
      <li class="active">
        <span>{{ page_number }}</span>
      </li>
    {% else %}
      <li>
        <a href="?page={{ page_number }}">
          {{ page_number }}
        </a>
      </li>
    {% endif %}
  {% endfor %}


  <li>
    {% if page_obj.has_next %}
      <a href="?page={{ page_obj.next_page_number }}">
        next
      </a>
    {% else %}
      <span>next</span>
    {% endif %}
  </li>
</ul>

Depending upon your CSS framework, if you use one, there may already be some pre-prepared styles to help you out with this.

This is all well and good, until you want paginated search results. Then, you can no longer rely on being able to rely on using ?page=N, as this would remove any search terms you were already using. Also, if you were using ajax to fetch and display stuff, you may need to use the whole URL, rather than just the query string.

Instead, we can use a Django form for searching, and just add in the pagination bits.

We will build a page that displays an optionally filtered list of people.

Our Person model will be deliberately simple:

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=256)

Likewise, our form will be simple. All we need to do is have the form able to filter our queryset.

from django import forms

class PersonSearchForm(forms.Form):
    query = forms.CharField(label=_('Filter'), required=False)

    def filter_queryset(self, request, queryset):
        if self.cleaned_data['name']:
            return queryset.filter(name__icontains=self.cleaned_data['query'])
        return queryset

Finally, we will need to subclass a ListView. We’ll mixin from FormMixin, so we get the form-handling capabilities:

from django.views.generic.edit import FormMixin
from django.views.generic import ListView

class FilteredListView(FormMixin, ListView):
    def get_form_kwargs(self):
        return {
          'initial': self.get_initial(),
          'prefix': self.get_prefix(),
          'data': self.request.GET or None
        }

    def get(self, request, *args, **kwargs):
        self.object_list = self.get_queryset()

        form = self.get_form(self.get_form_class())

        if form.is_valid():
            self.object_list = form.filter_queryset(request, self.object_list)

        context = self.get_context_data(form=form, object_list=self.object_list)
        return self.render_to_response(context)

There’s a little bit to comment on there: we override the get_form_kwargs so we pull our form’s data from request.GET, instead of the default.

We also override get, so we filter results if the form validates (which it will if there was data provided). We delegate responsibility for the actual filtering to the form class.

Everything else is just standard.

We will want to actually use this view:

people_list = FilteredListView.as_view(
    form_class=PersonSearchForm,
    template_name='person/list.html',
    queryset=Person.objects.all(),
    paginate_by=10
)

Now we need to render this.

<form id="person-list-filter" action="{% url 'person_list' %}">
  <input name="{{ form.query.html_name }}" value="{{ form.query.value }}" type="search">
  <button type="submit" name="page" value="1">{% trans 'Search' %}</button>
</form>

<div class="results">
  {% include 'person/list-results.html' %}
</div>

You may notice that the search button will result in page=1 being used. This is deliberate.

Our person/list-results.html is just the same as what our person/list.html looked like before, with the addition of the pagination template inclusion.

{% include 'pagination.html' with form_target='person-list-filter' %}

<ul class="people">
  {% for object in object_list %}
    <li>
      <a href="{% url 'person_detail' pk=object.pk %}">
        {{ object }}
      </a>
    </li>
  {% endfor %}
</ul>

Our pagination.html is very similar to how our other template above looked too, but using <button> elements instead of <a>, and we will disable those that should not be clickable. Also, the buttons contain an attribute indicating which form they should be bound to.

<ul class="pagination">
  <li>
    <button
      form="{{ form_target }}"
      {% if page_obj.has_previous %}
        name="page"
        value="{{ page_obj.previous_page_number }}"
        type="submit"
      {% else %}
        disabled="disabled"
      {% endif %}>
      prev
    </button>
  </li>

  {% for page_number in paginator.page_range %}
    <li class="{% if page_number = page_obj.number %}active{% endif %}">
      <button
        name="page"
        value="{{ page_number }}"
        type="submit"
        form="{{ form_target }}"
        {% if page_number = page_obj.number %}
          disabled="disabled"
        {% endif %}>
        {{ page_number }}
      </button>
    </li>
  {% endfor %}

  <li>
    <button
      form="{{ form_target }}"
      {% if page_obj.has_next %}
        name="page"
        value="{{ page_obj.next_page_number }}"
        type="submit"
      {% else %}
        disabled="disabled"
      {% endif %}>
      next
    </button>
  </li>
</ul>

We are getting close now. This will be enough to have clicking on the next/previous or page number buttons resubmitting our search form, resulting in the page reloading with the correct results.

But we can do a bit better. We can easily load the results using AJAX, and just insert them into the page.

We just need one additional method on our View class:

class FilteredListView(FormMixin, ListView):
    # ...

    def get_template_names(self):
        if self.request.is_ajax():
            return [self.ajax_template_name]
        return [self.template_name]

    # ...

and one addition to our view declaration:

people_list = FilteredListView.as_view(
    form_class=PersonSearchForm,
    template_name='person/list.html',
    ajax_template_name='person/list-results.html',
    queryset=Person.objects.all(),
    paginate_by=10,
)

I’ll use jQuery, is it makes for easier to follow code:

// Submit handler for our form: submit it using AJAX instead.
$('#person-list-filter').on('submit', function(evt) {
  evt.preventDefault();

  var form = evt.target;

  $.ajax({
    url: form.action,
    data: $(form).serialize(),
    success: function(data) {
      $('#results').html(data)
    }
  });
});

// Because we are using buttons, which ajax submit will not send,
// we need to add a hidden field with the relevant page number
// when we send our request.
$('#person-list-filter').on('click', '[name=page]', function(evt) {
  var $button = $(evt.target).closest('button');
  var $form = $button[0].form;

  if (!$form.find('[type=hidden][name=page]')) {
    $form.append('<input type="hidden" name="page">');
  }

  $form.find('[type=hidden][name=page]').val($button.val());

  $form.submit();
});

That should do nicely.


There is another thing that we need to think about. If we leave the next/prev buttons, then we need to handle multiple clicks on those buttons, which fetch the subsequent page, and possibly cancel the existing AJAX request.

I do have a solution for this, too, although it complicates things a fair bit. First, we need to add some attributes to the next/prev buttons:

<ul class="pagination">
  <li>
    <button
      form="{{ form_target }}"
      {% if page_obj.has_previous %}
        name="page"
        value="{{ page_obj.previous_page_number }}"
        type="submit"
        data-increment="-1"
        data-stop-at="1"
      {% else %}
        disabled="disabled"
      {% endif %}>
      prev
    </button>
  </li>

  {% for page_number in paginator.page_range %}
    <li class="{% if page_number = page_obj.number %}active{% endif %}">
      <button
        name="page"
        value="{{ page_number }}"
        type="submit"
        form="{{ form_target }}"
        {% if page_number = page_obj.number %}
          disabled="disabled"
        {% endif %}>
        {{ page_number }}
      </button>
    </li>
  {% endfor %}

  <li>
    <button
      form="{{ form_target }}"
      {% if page_obj.has_next %}
        name="page"
        value="{{ page_obj.next_page_number }}"
        type="submit"
        data-increment="1"
        data-stop-at="{{ paginator.num_pages }}"
      {% else %}
        disabled="disabled"
      {% endif %}>
      next
    </button>
  </li>
</ul>

And our click handler changes a bit too:

$('#person-list-filter').on('click', 'button[name=page]', function() {
  var page = parseInt(this.value, 10);
  var $form = $(this.form);
  // Only update the value of the hidden form.
  if (!$form.find('[name=page][type=hidden]')) {
    $form.insert('<input name=page type=hidden>');
  }
  $form.find('[name=page][type=hidden]').val(page);
  // Increment any prev/next buttons values by their increment amount,
  // and set the disabled flag on any that have reached their stop-at
  $form.find('[data-increment]').each(function() {
    this.value = parseInt(this.dataset.increment, 10) + page;
    // We want to disable the button if we get to the 'stop-at' value,
    // but this needs to happen after any submit events have occurred.
    if (this.dataset.stopAt) {
      setTimeout(function() {
        this.disabled = (this.value == this.dataset.stopAt);
      }.bind(this), 0);
    }
  });

  $form.submit();
});

Since this was posted, I have written a number of pages that use this pattern. Some of the improvements that could be made are listed below:

It’s possible to have these results automatically update as the user types. Obviously, this only makes sense if we have AJAX submission happening!

$('#person-list-filter').on('keyup', function() {
  this.submit();
})

If you have lots and lots of results, you probably won’t want to show every button. Often you will see the first few, and a couple either side of the current page (and sometimes the last few). This is almost possible to do with pure CSS, but not quite. I do have a solution for this, but it’s probably worthy of a complete post of its own.

Another situation that is likely to happen is this:

  • User clicks on a page other than page 1 of results. Let’s say page N.
  • User enters text in search field which results in fewer than N pages of results being available.
  • User gets error message.

We can fix this with an overridden method:

class FilteredListView(FormMixin, ListView):
    # ...

    def paginate_queryset(self, queryset, page_size):
        try:
            return super(FilteredListView, self).paginate_queryset(queryset, page_size)
        except Http404:
            self.kwargs['page'] = 'last'
            return super(FilteredListView, self).paginate_queryset(queryset, page_size)

    # ...

You’ll also need to add in a get_prefix() method if you are using an old Django, but really you should just upgrade.


Updated: I’ve added in some more error checking into the templates, to prevent exceptions when attempting to render previous and next page links (thanks inoks).

Updated: I’ve changed to use the preferred urlpattern syntax. (thanks knbk).

Updated: Delegate to the form for filtering. Add discussion of other extensions. Add button[form] attributes.

Review Django Essentials

Django Essentials. Note it appears the name of this book has been changed from “Getting started with Django”.

I’ll be clear from the outset: I have some pretty strong issues about the first part of this book, and I’m going to be quite specific with the things that I think are wrong with it. Having said that, the later chapters are far better than the earlier ones.

I am not sure, however, that it’s any more accessible than the official documentation. There’s probably a market for a more thorough tutorial than the one on the Django website, however, I’m not sure this book, as it stands, is that tutorial.

How could this book be better?

I think it gets bogged down providing detail in areas that are just not that important at that point in time. I also think it misses a good overview of the product that is being built: indeed it’s never clear, even after completing the book, exactly what the product is supposed to do.

In my opinion, the code examples are hard to read. This is a combination of the styling of the source code, and the layout. That bold, blue is quite jarring in comparison to the rest of the text, and the repeated lack of PEP8 compliance, especially when coupled with reading it on a narrow device, make it hard to follow the code. Multiple code blocks (which should be in separate files) flow together, making it hard to see where one stops and the next begins.

The book fails early on to push some basic Python standards and best practices. In some cases these are addressed later on, however it is not obvious what is gained by not starting from this point. Similarly, there are some security issues that should never have passed through editing. Again, these are addressed later, but I feel that the damage has already been done. Friends don’t let friends store passwords in plain text; and very little is gained by disabling the CSRF protection.

But it’s not just the source code that seems lacking. The technical translation at times varies between the obtuse and the absurd. Early chapters in particular (the ones that I think are more important when teaching basic concepts) contain sentences or paragraphs that required me to re-read several times in order for me to be able to translate it into something that made sense to me. And I’ve been writing Django code for about 6 years (and Python code for probably another 6 before it).

Would I recommend it?

After hitting the plain-text-password section, I said no. I actually have a couple of guys much newer to Django than me at work, and I did not want them to read the book at that point.

However, after I’d cooled down, and actually started to draft this review, I re-read the start, and read the rest. There is some good information, but I’m not sure that it’s presented in a way that is better than the official documentation, or some other resources out there.

So, I’m really not sure I’d recommend it to a beginner. There are too many things early in the book that set up for future failures (or at least, unlearning). And I’m not sure I’d recommend it to an intermediate developer. It’s not that it’s bad (with the caveats below), it’s just not as good as what you can read on the Django website.

Some of the more important specific issues that I feel are wrong with this book follow. These are often things that beginners struggle with. You’ll notice less stuff about the later chapters. That’s because they are better.


Code standards.

Throughout the book, there are inconstencies with how individual models and modules are named. Whilst this seems pedantic, computers are pedantic, when it comes to textual source code. It does matter if you use Work_manager in one place, and the Workmanager in another.

Further, in Python, we always (unless the project we are working on has different standards) use snake_case for module names, TitleCase for class names, and snake_case again for variables, methods and functions, and ANGRY_SNAKE_CASE for constants. There’s just no reason to go against these guidelines.

Okay, I may have made up the name ANGRY_SNAKE_CASE.

Finally, Python code should be compliant to PEP8. I’m not sure that a single line of code in this book would pass through a PEP8 checker.


MVC/MVT

The section on “The MVC Framework” (tip: Django isn’t) seems superfluous. It would be far better to avoid this term, and instead describe the typical flow of data that one might see in a request-response cycle handled by Django:

  1. The client sends a request to the server
  2. The server passes the request to the correct view function (according to the url)
  3. The view function performs the required work, and returns an HttpResponse object.
  4. The HttpResponse object is sent back to the server.

Depending upon the view, it may do any or all of the following:

  • Process data provided by the client using a Form
  • Load and/or save data to/from the database
  • Render an HTML template or return a JSON (or XML) response.
  • Perform any other action that is required

The whole concept of a Controller doesn’t really make sense in the context of a web page, although purely within the client-side of a Single-Page-Application it could.


Installation.

I’ve written about installation before, notably discussing how every project should be installed into a new virtualenv. Indeed, I even install every command-line application in it’s own environment. And, most of the experienced Pythonistas I have come across always use a new virtualenv for each project, both in development and in deployment. So it was worriesome to see a non-best-practice approach used for installation.

Although this is addressed later in the book (in the chapter on deployment), I fail to understand the benefit of not mentioning it now. There are so many reasons to use virtualenv in development, and none I can think of for avoiding it.


Security

There are two things in this book that set off alarm bells for me, with respect to security. I’ve mentioned them above, but I’ll go into a little more detail.

The more minor error is the disabling of CSRF checking. The inbuilt Django CSRF protection ensures a range of attacks are ineffective, and the mental cost of using this protection is fairly low: in any view that you are POSTing back to the server, you need to include the CSRF token. This is usually done as a form field, using the csrf_token template tag.

Disabling it is almost never a good idea.

Suggesting that you disable it “just for now” as the only thing you change in the initial settings file is even worse. A beginning programmer may begin routinely disabling CSRF protection as they start a new project, and not re-enabling it. Bad form.

The severe error is storing user passwords in plain text. This flaw is so basic that, even though it is “fixed” later in the book, as is CSRF protection, by then I feel it is too late. Even hinting that either of these things is acceptable to do as an interim measure (do you have any idea how much “interim” or temporary code I have in production still, years after it was written?) makes me really struggle to continue reading.

However, I am glad I did.


URL routing and regular expressions

This book contains a reasonable explanation of regular expressions, but I think it would have been better suited to have a more concrete set of examples of how regular expressions can be used for URL routing in Django. For instance:

You could use a series of examples, like these, to describe some of the key rules of regular expressions, and at the same time discuss parameters. Alternatively, you could skip regular expressions at all at this point in time, and use simple strings.

When discussing URL routing, the following paragraph is a great example of a failure to explain what is essentially a simple process.

“After having received a request from a web client, the controller goes through the list of URLs linearly and checks whether the URL is correct with regular expressions. If it is not in conformity, the controller keeps checking the rest of the list. If it is in conformity, the controller will call the method of the corresponding view by sending the parameters in the URL.”

Phrased in a simpler manner:

“The URL resolver looks at each pattern in the order it is defined. If the regular expression of the url route matches the request’s path, the matching view method is called with the request object and any other parameters defined, otherwise it is passed on to the next route.”


Templates

This book presents a reasonable discussion of the Django template language. There are some parts that made me do a double-take (legacy of templates? Oh, you mean inheritance), and there are lots of important typos, missing characters, or just plain wrong source code.

And then there’s render_to_response.

Back in the day, we used to use a function called render_to_response(), which required you to manually pass a RequestContext instance to it: we have since moved on to render(). There is no need really to mention render_to_response() in anything other than a footnote: “You might see older code that uses…”

Talking about the context itself is good, but I think it should be more explicit: “You pass three arguments to render(): the request object, the template path and a dict containing the variables from your view that you want available in the rendering context”.

Oh, and later in the book, locals() is passed as the context. The Zen of Python: explicit is better than implicit. Yes, in the box immediately afterward, it is suggested that you don’t do this.

Doing something, and then suggesting that you don’t do it is counterproductive.


Models

Django’s ORM gets some criticism at times. I find it’s mostly good enough for my needs, indeed, it sometimes does a better job of writing queries than me. However, it is an Object Relational Mapper, and discussing how that works is simple terms would probably be useful. It’s not strictly necessary to have a strong background in relational databases and SQL to use correctly, but understanding some of the implications of how accessing things from the ORM can cause issues, or indeed, how the data is even represented in the database can only be a positive.

“To make a connection between databases and SQL, we can say that a model is represented by a table in the database, and a model property is represented by a field in the table.”

Cumbersome language again, and not totally wrong, but probably slightly misleading. Perhaps:

”…a Model class is represented by a table in the database, and an instance of that Model is represented by a row/tuple. Fields on the model (which appear as special attributes) are the columns of that row.”

A discussion of south is also somewhat welcome. Even though the soon to be released Django 1.7 contains a superior (but written by the same person) implementation of migrations, it’s certainly still worth understanding a little about how south works.

However, there is one false statement when discussing south:

“Never perform the Django syncdb command. After running syncdb --migrate for the first time, never run it again. Use migrate afterwards.”

This is a broken statement. If you were to add a new app that did not have migrations, then without a syncdb command, the tables for it’s models would not be created.

This chapter suddenly gets a whole lot worse as soon as a model is defined with a plain-text password field, but I’ve already discussed that.


Django.contrib.admin

I spend a lot of time trying to talk people out of using the admin module as anything other than a top-level admin tool. Really, this is a tool that is fantastic for viewing and maniplating data in the early stages of development, and great for emergency access for a developer or trusted admin to view or change data on a live system, but trying to push too much into it is problematic. I say that as someone who has a project that relies far too much on the admin.

It’s also hard to not discuss the admin, as it really is a great tool, but it’s really important to understand it’s limitations.

I quote Django core contributor Russ Keith-Magee:

“Django’s admin is not meant to be the interface for your website”


QuerySets

Interestingly, the chapters on QuerySets and Forms are actually far better than those preceeding. The source code isn’t formatted any better, but it really does seem that the translations make (mostly) more sense.

I do think the manner of adding data to the database is bunkum, however. Given that we just covered the admin interface, it would make sense to use this to add some data, before looking at QuerySets. And we could delve into manage.py shell in order to illustrate how QuerySets, their various methods, and some model methods actually work.

And while we are on anti-patterns: queryset[:1].get() is pointless. You might as well just use queryset[0]. It is exactly the same SQL, and easier to read.


Forms

And then we get to Forms. I’m a really big fan of Django’s form handling: it’s something that makes dealing with user input much safer, and simpler. And this chapter explains that, but, from an educational perspective, I’m cautious that showing someone the wrong way to do something first is counter-productive.

Sure, I understand that it makes a point, and having done something a laborious, error-prone way for some time, and then being shown a safer, faster, easier method is eye-popping, but I fear that for some percentage of readers, they will get a takeaway that not using Forms is a valid choice.

Even beginning with a ModelForm is probably a nice approach, as you can get a lot of functionality with almost no code at all.


CBV

The section on Class Based Views is okay too. These are something else that are often hard to understand, and the initial official documentation on them was sadly lacking. Once you have your head around how they work they can be really powerful, and this book takes the right approach in suggesting caution about not always using them. Similarly, it is great that these were not used as a starting point.

However, I find that the explanations and descriptions are not always clear. Certainly as an experienced Django user I can read and understand what is going on, but as a beginner I think this chapter would be hard to follow. Perhaps a simple discussion about what the different CBV are used for, and how the ViewClass.as_view() pattern works, and why it is required, and then some examples.

Perhaps a better approach would have been to have written Model and Form classes earlier, and then writing the function-based views to CRUD those objects, and then rewriting the exact same views using CBV.


Django.contrib.auth

Although far less impressive that the admin, I think that auth is a more important module. Especially now given we can easily swap out auth.User, to get the desired user functionality, I think this is something that should be given more weight. It doesn’t need to necessarily come before the chapter about the admin, but it should be discussed, or at least introduced, before anything is done with a User-ish model.

I think this book does not do justice to Django.contrib.auth. There are lots of views and forms that can (and should) be used to save writing your own code, which is more likely to have bugs. Also, even if the basic User model is used in the example, a discussion of how easy it is to swap out, and get “email as username” functionality is certainly deserved.


AJAX

I’m probably 50-50 on the AJAX chapter. I guess I understand why you’d want to include a chapter on it, but I worry that this chapter maybe doesn’t do enough. If it’s an introduction to AJAX I’m not sure it seems up to that.

I do often use jQuery, but it’s probably not too tricky to rewrite the code to delete an object using vanilla Javascript. And if you are going to use jQuery, you should get the idioms right.

var cases = $('nav ul li').each(function() {
  $(this).addClass('nav_item');
})

Can easily be written:

$('nav ul li').addClass('nav_item');

And we probably shouldn’t use $(foo).html($(foo).html() + bar), when really we want to use $(foo).append(bar).

Also, I don’t think that using csrf_exempt is a great idea: the official documentation has details about how to use AJAX and still keep CSRF protection.


Thanks to:

  • Maior in #django for proofreading.

Get the class of a Django view function

I needed to be able to get the class of a view function, once it had been instantiated via MyView.as_view(). I’d done something similar in the past to get the base callable view, but this was slightly different.

from django.views.generic.base import View

def get_class(func):
    if not getattr(func, 'func_closure', None):
        return
        
    for closure in func.func_closure:
        contents = closure.cell_contents
        
        if not contents:
            continue
        
        if getattr(contents, '__bases__', None) and issubclass(contents, View):
            return contents
        
        result = get_class(contents)
        if result:
            return result

This is a recursive function that does a depth-first search on the function object, until it finds an object that is a class, and is a subclass of django.views.generic.base.View.

You can use it like:

from django.core.urlresolvers import resolve
view = resolve('/path/to/url')

view_class = get_class(view.func)

KnockoutJS HTML binding

TL;DR: Don’t use KnockoutJS html binding lots of times in your page.

I’m in the middle of rewriting a large part of our application in HTML: for a lot of the interactivity stuff, anything more than just a simple behaviour, I’m turning to KnockoutJS.

Mostly, it’s been awesome. Being able to use two-way binding is the obvious big winner, but dependency tracking is also fantastic.

However, I have had some concerns with performance in the past, and this was always on my mind as I moved into quite a complicated part of the system.

Our approach is that we are not creating a single page application: different parts of the system are at different URLs, and visiting that page loads up the relevant javascript. This is a deliberate tradeoff, mostly because for the forseeable future, our software will not work without a connection to our server: most of the logic related to shift selection is handled by that. We aren’t about to change that.

While rewriting the rostering interface, I initially had Django render the HTML, and I added behaviours. This was possible, and quite fast, however as the behaviours became more complex, I was doing things like sending back scripts that caused other parts of the page to refresh themselves. It was all rather fragile.

So, I went back to KnockoutJS. After a while, I noticed significant slowdowns when dealing with pages that really shouldn’t have been that slow. I’d optimised the database access for the fetching of shifts (and indeed, it is much faster than before), but it felt like Knockout was very sluggish.

I do have quite a few ko.computed() objects, perhaps they were slowing it down? Notably, the function that filters which shifts should be shown where on the page.

So I put some console.time()/timeEnd() calls in place.

Nope: the initial parse of the data runs in less than half a millisecond: instantiating the objects took a while, but the filtering of shifts was taking much less than 100ms.

However, the initial call to ko.applyBindings() was taking several seconds.

The most annoying thing was that when the developer tools were open, it was taking far, far longer!

Eventually, through using the developer tools profiling, I discovered that the slowdown was because of repeated code like:

foo.innerHTML = bar;

Initially, I had thought this slowdown was in KnockoutJS itself, and played around with other ways of binding (such as using the knockout-repeat plugin). Still slow.

Eventually, however, I worked out that it was the act of interacting with the DOM in this manner that was slow. More specifically, the assignation to innerHTML was occurring in the html: binding.

Looking through my source code, I discovered code that looked like:

<span data-bind="html: icon"></span>

And, icon contained the HTML I wanted to put in there:

<i class="icon-ok"></i>

Which was a bad idea to begin with: it conflated UI with data to begin with. So, I replaced the code that looked like:

this.icon = '<i class="icon-ok"></i>';

With:

this.icon = {
  'icon-time': true
};

And then, in the HTML:

<i data-bind="css: icon"></i>

Bingo. All of a sudden, a page that took several seconds to re-render does so in around a second.

It’s important to note that this pattern was repeated several times for each shift: and we have possibly dozens of shifts on a page. When you really need to use the html binding that’s fine, just don’t stick it inside a loop (or worse still, inside a nested loop).

django-boardinghouse

I wrote a heap of code last April, under the name Multi-tenanted Django. It was fairly complete, but not especially well documented, and not really that well tested.

Recently, I’ve been having to write some reporting code at work that dealt with objects that are generated by django-reversion. If I was using tenancy-based partitioning, it would be really easy for me to just fetch the changes that were made to data from a given company: instead I need to do heaps of queries, and lots of filtering.

Which got me enthused on django-multi-schema, which has since been renamed to django-boardinghouse. And, it now has it’s own documentation, and an example project.

I’m still a bit cagey about releasing it to pypi, as the example project is pretty simple, and I’d like to build that (or another project) up a bit to see if I’ve made any more bad decisions: I’ve already changed it to opt-in to seperate schema to opt-out, and added in a configurable SCHEMA_MODEL.

It currently passes all tests under django 1.4 - 1.6, and has some functionality under django 1.7, but the migration handling code is not well tested just yet.