Django second AutoField

2015-05-17 @ 22:43:12

here.

Sometimes, your ORM just seems to be out to get you.

For instance, I’ve been investigating a technique for the most important data structure in a system to be essentially immuatable.

That is, instead of updating an existing instance of the object, we always create a new instance.

This requires a handful of things to be useful (and useful for querying).

We probably want to have a self-relation so we can see which object supersedes another. A series of objects that supersede one another is called a lifecycle.
We want to have a timestamp on each object, so we can view a snapshot at a given time: that is, which phase of the lifecycle was active at that point.
We should have a column that unique per-lifecycle: this makes for querying all objects of a lifecycle much simpler (although we can use a recursive query for that).
There must be a facility to prevent multiple heads on a lifecycle: that is, at most one phase of a lifecycle may be non-superseded.
The lifecycle phases needn’t be in the same order, or really have any differentiating features (like status). In practice they may, but for the purposes of this, they are just “what it was like at that time”.

I’m not sure these ideas will ever get into a released product, but the work behind them was fun (and all my private work).

The basic model structure might look something like:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = models.AutoField(primary_key=False, editable=False)

    superseded_by = models.OneToOneField('self',
        related_name='supersedes',
        null=True, blank=True, editable=False
    )
    timestamp = models.DateTimeField(auto_now_add=True)

    # Any other fields you might want...

    objects = PhaseQuerySet.as_manager()

So, that looks nice and simple.

Our second AutoField will have a sequence generated for it, and the database will give us a unique value from a sequence when we try to create a row in the database without providing this column in the query.

However, there is one problem: Django will not let us have a second AutoField in a model. And, even if it did, there would still be some problems. For instance, every time we attempt to create a new instance, every AutoField is not sent to the database. Which breaks our ability to keep the lifecycle_id between phases.

So, we will need a custom field. Luckily, all we really need is the SERIAL database type: that creates the sequence for us automatically.

class SerialField(object):
    def db_type(self, connection):
        return 'serial'

So now, using that field type instead, we can write a bit more of our model:

class Phase(models.Model):
    phase_id = models.AutoField(primary_key=True)
    lifecycle_id = SerialField(editable=False)
    superseded_by = models.OneToOneField('self', ...)
    timestamp = models.DateTimeField(auto_now_add=True)

    def save(self, **kwargs):
        self.pk = None
        super(Phase, self).save(**kwargs)

This now ensures each time we save our object, a new instance is created. The lifecycle_id will stay the same.

Still not totally done though. We currently aren’t handling a newly created lifecycle (which should be handled by the associated postgres sequence), nor are we marking the previous instance as superseded.

It’s possible, using some black magic, to get the default value for a database column, and, in this case, execute a query with that default to get the next value. However, that’s pretty horrid: not to mention it also runs an extra two queries.

Similarly, we want to get the phase_id of the newly created instance, and set that as the superseded_by of the old instance. This would require yet another query, after the INSERT, but also has the sinister side-effect of making us unable to apply the not-superseded-by-per-lifecycle requirement.

As an aside, we can investigate storing the self-relation on the other end - this would enable us to just do:

    def save(self, **kwargs):
        self.supersedes = self.pk
        self.pk = None
        super(Phase, self).save(**kwargs)

However, this turns out to be less useful when querying: we are much more likely to be interested in phases that are not superseded, as they are the “current” phase of each lifecycle. Although we could query, it would be running sub-queries for each row.

Our two issues: setting the lifecycle, and storing the superseding data, can be done with one Postgres BEFORE UPDATE trigger function:

CREATE FUNCTION lifecycle_and_supersedes()
RETURNS TRIGGER AS $$

  BEGIN
    IF NEW.lifecycle_id IS NULL THEN
      NEW.lifecycle_id = nextval('phase_lifecycle_id_seq'::regclass);
    ELSE
      NEW.phase_id = nextval('phase_phase_id_seq'::regclass);
      UPDATE app_phase
        SET superseded_by_id = NEW.phase_id
        WHERE group_id = NEW.group_id
        AND superseded_by_id IS NULL;
    END IF;
  END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER lifecycle_and_supersedes
  BEFORE INSERT ON app_phase
  FOR EACH ROW
  EXECUTE PROCEDURE lifecycle_and_supersedes();

So, now all we need to do is prevent multiple-headed lifecycles. We can do this using a UNIQUE INDEX:

CREATE UNIQUE INDEX prevent_hydra_lifecycles
ON app_phase (lifecycle_id)
WHERE superseded_by_id IS NULL;

Wow, that was simple.

So, we have most of the db-level code written. How do we use our model? We can write some nice queryset methods to make getting the various bits easier:

class PhaseQuerySet(models.query.QuerySet):
    def current(self):
        return self.filter(superseded_by=None)

    def superseded(self):
        return self.exclude(superseded_by=None)

    def initial(self):
        return self.filter(supersedes=None)

    def snapshot_at(self, timestamp):
        return filter(timestamp__lte=timestamp).order_by('lifecycle_id', '-timestamp').distinct('lifecycle_id')

The queries generated by the ORM for these should be pretty good: we could look at sticking an index on the lifecycle_id column.

There is one more thing to say on the lifecycle: we can add a model method to fetch the complete lifecycle for a given phase, too:

    def lifecycle(self):
        return self.model.objects.filter(lifecycle_id=self.lifecycle_id)

(That was why I used the lifecycle_id as the column).

Whilst building this prototype, I came across a couple of things that were also interesting. The first was a mechanism to get the default value for a column:

def database_default(table, column):
    cursor = connection.cursor()
    QUERY = """SELECT d.adsrc AS default_value
               FROM   pg_catalog.pg_attribute a
               LEFT   JOIN pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum)
                                                   = (d.adrelid,  d.adnum)
               WHERE  NOT a.attisdropped   -- no dropped (dead) columns
               AND    a.attnum > 0         -- no system columns
               AND    a.attrelid = %s::regclass
               AND    a.attname = %s"""
    cursor.execute(QUERY, [table, column])
    cursor.execute('SELECT {}'.format(*cursor.fetchone()))
    return cursor.fetchone()[0]

You can probably see why I didn’t want to use this. Other than the aforementioned two extra queries, it’s executing a query with data that comes back from the database. It may be possible to inject a default value into a table that causes it to do Very Bad Things™. We could sanitise it, perhaps ensure it matches a regular expression:

NEXTVAL = re.compile(r"^nextval\('(?P<sequence>[a-zA-Z_0-9]+)'::regclass\)$")

However, the trigger-based approach is nicer in every way.

The other thing I discovered, and this one is really nice, is a way to create an exclusion constraint that only applies if a column is NULL. For instance, ensure that no two classes for a given student overlap, but only if they are not superseded (or deleted).

ALTER TABLE "student_enrolments"
ADD CONSTRAINT "prevent_overlaps"
EXCLUDE USING gist(period WITH &&, student_id WITH =)
WHERE (
  superseded_by_id IS NULL
  AND
  status <> 'deleted'
);