Read-only data from your database in Django

I had the need to create a column in some database tables that are completely controlled by the database, but the value of which is sometimes needed by the Django object.

It should never be presented in a Form, and never, ever be written to by the Django infrastructure.

So, we need a way to fetch the data from the database, but, even if the value is changed, and the object saved, is not written back.

The detail of how this data is set in the database is irrelevant: it’s a column that gets it’s value from a sequence (and, incidentally, this sequence is shared across multiple tables).

But, we need a way to get this data.

A nice technique is to leverage two parts of Django: the QuerySet.extra(select={}) method to actually add this field to the query, and Manager.get_query_set() (get_queryset() in older versions of Django) to make this apply to every fetch of the objects.

Our extra column will be called sequence_number

class SequenceNumberManager(models.managers.Manager):
    def get_query_set(self):
      return super(SequenceNumberManager, self).get_query_set().extra(select={
        'sequence_number': '"%s"."sequence_number"' % self.model._meta.db_table
      })

class Thing(models.Model):
    # Column definitions. Do not define sequence_number!

    objects = SequenceNumberManager()

That’s it.

Now, Thing.objects.all()[0].sequence_number will give you the value from the database. Because it’s not a Django field, it will never be written back to the database.

The SequenceNumberManager, as it stands, could be applied to multiple models, as it dynamically looks for the database table that should be used for the field lookup. You need to define the table name, as otherwise if you join to another table with the same column name, your database will (correctly) reject the query, as it is ambiguous to which table it refers.

I’ve done a similar thing, but using django-model-utilsPassThroughManager in conjunction with a QuerySet subclass, since I’m a big fan of the approach described at Building a higher-level query API: the right way to use Django’s ORM, which I link to at least once a week in IRC.

I’d be interested to hear of alternative methods of achieving this, or if there are any possible drawbacks. I think you could do it with a custom field (that never wrote to the database), but I think that would actually be a whole lot more code. There is also the benefit in this case that you could have the .extra() call as a queryset method, so you only add it when you need it, although the performance gains there would be insignificant, I suspect.

In-Database Audit Trail

I’ve been thinking about audit trails, object versioning and the like a bit lately. We have situations where it’s actually important to be able to know what changes were made, when, and by whom.

The tool we are currently using in Django for this is django-reversion, but I’m not that happy with it. Part of the problem is that it adds a fair amount of overhead into each request, which we have noticed, and also that querying the data it produces is not simple, unless you are just looking for changes made to a specific object.

For our case, we want to know all the changes made to a set of objects (that may not all be of the same type), made within a given range, where there is (eventually) a foreign key relationship back to our Company model.

Part of the problem is just that: a custom solution could store the Company relationship explicitly, but also, we really care about what the changes were, rather than what the object looks like after the changes. Well, technically, we want both.

However, it has gotten me thinking about other ways to solve this problem.

In most DBMSs, it’s possible to get a trigger function to run when an update happens to each row, which makes it possible to get access to this information. Indeed, googling “Postgres audit trigger” pops up some hits that are highly relevant. I looked at Audit trigger 91plus, since it’s “improved”. It has some really nice features like storing the audit data in an hstore column, which means you can query it.

However, one place where this type of trigger breaks down is that it’s not generally possible to get the “application user” associated with a request, only the “database user”, which for our system (and most other web applications) is fixed for all access.

One way to get around this might be to, at the start of every database transaction, inject a call that creates a temporary table, with the extra bits of data that you want to log, and then use that in the trigger function.

CREATE TEMP TABLE IF NOT EXISTS
  "_app_user" (user_id integer, ip_address inet);

Then we need to add (or update) the one row that will contain our data. We must ensure that we only ever have one row in this table.

UPDATE _app_user SET user_id=%s, ip_address=%s;
INSERT INTO _app_user (user_id, ip_address)
  SELECT %s, %s WHERE NOT EXISTS (SELECT * FROM _app_user);

This code will ensure that the first statement (UPDATE) will affect all rows in the table (of which there will be at most one), and the second statement (INSERT ... SELECT ... WHERE NOT EXISTS ...) will only create a new row if there are no rows currently in the table.

It’s up to you to then pass the correct data to this. I’m currently looking at doing this using Django middleware, although I suspect this may fall down using the newer transaction handling, as otherwise we could have just ensured our middleware ran after the TransactionMiddleware. It may be possible to do it with a custom database backend, but it needs to somehow get access to the request object (which contains the user, and the ip address). Obviously, you could log other data about the request, too.

The final part of the puzzle is to inject this data into the row that will be used for the audit table entry. I modified the table definition so it included columns for the data I wanted: app_user_id and app_ip_address.

Then, inside the actual trigger function, after the audit_row object has been created, but before it is written to the table, we inject the data we want.

We need to be a little careful, as it’s possible the table does not exist:

BEGIN
  PERFORM 
    n.nspname, c.relname 
  FROM
    pg_catalog.pg_class c 
  LEFT JOIN 
    pg_catalog.pg_namespace n
  ON n.oid = c.relnamespace
  WHERE
    n.nspname like 'pg_temp_%' 
  AND
    c.relname = '_app_user';

  IF FOUND THEN
    FOR r IN SELECT * FROM _app_user LIMIT 1 LOOP
      audit_row.app_user_id = r.user_id;
      audit_row.app_ip_address = r.ip_address;
    END LOOP;
    END IF;
END;

This checks to see if the _app_user table exists in any of the valid temporary table namespaces, and if so, grabs the first (and only, from above) entry, using the values to update the row.

This function then works: if there is a temporary table with this name, it uses these fields when creating the audit, if not, it creates the audit row with empty values. This would mean that some audit statements may not contain proper user data, but in the case of Django, it’s possible to make changes outside of the request-response cycle. You could require that a user starting a shell session authenticates with a valid django username+password, but that still leaves management commands. I guess you could have a system account, but leaving these entries blank is like an explicit system user.

I haven’t got any production code using anything like this: I’d still want to test that it works as expected as part of the request, and would want to build up some method of querying it. There’s probably no reason you couldn’t do the table definition as a Django model (and indeed, have the function definition as a migration).

On Django and Check Constraints

I’ve come to love Django, and now I seem to be spending not only my work time, but also my own time working on one django project or another. Today at work I came across the need to have check constraints in a database. For instance, this can be used to ensure that the start of an event is before the end of the event, or that a certain value has a particular range of values.

For the SQL savvy, it can look like:

CREATE TABLE event (
…more columns here…
start datetime,
finish datetime,
count integer CHECK (count >= 0),
CONSTRAINT event_starts_before_finish CHECK (start < finish)
);

There are two types of constraint here: the column-level constraint, which in this instance can be done with django, and the table-level constraint, which in many systems cannot be a part of a column definition. In PostgreSQL, my DBMS of choice, it is possible to define a constraint that affects more than one column as part of a column, but I prefer not to. Note also that I am using the syntax form that allows me to name the constraint. This means I can alter or drop it later.

As I mentioned, django can do the first type of constraint, as long as it is a > 0 constraint, using one of the field types that subclasses PositiveInteger. However, there is no functionality built into django to do the latter. And I would like to use them.

It is possible to just add the constraints to an already existing database: indeed, that is what I did for work. Having the constraints at the database level means that, since I have more than one interface to my datastore (don’t ask: one of them is for an old SOAP interface I need to keep around), I want to ensure that even if someone accesses it outside of django, they cannot put in data that breaks these constraints. Similarly, if I use an interface other than the admin interface, or heaven forbid, open up the database in raw form, I cannot accidentally put in data that breaks this validation.

But, pure database level constraints don’t give very nice feedback in django. It is nicer to have the pretty red boxes around my field than the traceback. So, I want the validation to occur on the field level as well. As long as I am using django’s forms (and my API for RESTful access will use them for validation), then I will have these errors nicely presented.

So, to that end, I have created some code that allows for the definition and enforcement of both of these types of constraint.

The column form allows for a new keyword argument to a field:

count = models.IntegerField(constraint=">= 0")

Notably, the string that can be passed in must conform to the pattern “cmp value”, where cmp is one of the comparison operators (<, >, <=, >=, =, !=), and value is a valid value for this column type. It will be passed to the to_python() method of this field when comparing. There must be a string between the two parts.

The other form is a new attribute on the Meta class of a model.

check_constraints = (‘start < finish’,)  

This must be a tuple of strings, where a string is of the form “column cmp column”. Again, there must be a space either side of cmp, and each column name must be a valid column. Not meeting these criteria will result in a validation error.

From these definitions, the database constraints will be applied, and validation of forms will also occur.

I have deliberately made the constraints simple (ie, not callable objects) so that they can easily be converted to database constraints. For instance, they can effectively be transferred straight through to the database (with the addition of the column name in the case of the column constraint).

I am going to create a ticket in the django trac, and submit a patch. Guessing I should write up some test cases, though!

Navicat now does Oracle.

According to an item that just popped into my feed, there is now a version of Navicat that connects to Oracle DBs.

Cool. I like Navicat.

Python Properties

One of the criticisms of many languages is that they are so complex, that people do not use all of the features of them. Meaning that one person might write a program in a language that another, fluent reader/writer of that language may not be able to understand. I think this is just rubbish, at least to some respect. If you see some code in a language that doesn’t make sense, then hopefully you should be able to either figure it out, or look up in the documentation (what, your language doesn’t have complete online, searchable documentation?) what is going on.

A similar thing happened to me the other day - and I didn’t even need to look up in the Python documentation to see what the story was.

People often criticise python because it lacks the ability to mark a variable as private, or protected. All attributes of a class are automatically public, and it is only really a guideline that if you prefix a variable with one or two underscores, you are marking it as protected or private. Namespace mangling means that this becomes slightly more than just a guideline, IIRC, although I’m not really sure of how this all works. Basically, the theory is that if you prefix an attribute, that is a marker to users of your API that you really shouldn’t use this.

I’ve often been frustrated when writing Java programs that attributes aren’t accessible, but I can see the value in only allowing protected access, through getters and setters. You can do this to prevent user classes/objects from putting garbage data into your attributes. This becomes even more important in Python, as its dynamic (but still strong) typing means I could put a “GOO” in where you expected an integer.

I discovered yesterday, as I was implementing a database in SQL Alchemy, that python has the ability to only allow protected access, through getters and setters, using the property function.

So, for this project, I have a database, where Sheep objects are stored, along with a variety of attributes. Sheep can be either Rams or Ewes, but the data that is stored about a Ram is virtually identical as that stored about a Ewe. However, using polymorphic mapping, I can have Ram and Ewe objects both stored in the Sheep table, since they are both sub-classes. Furthermore, I can use the gender attribute to determine if the object retrieved from the database should be a Ram or a Ewe. This is much better than anything I was able to do with J2EE, at least as far as I could find.

Because I am using a composite primary key (again, not something that is easy to do with J2EE, without defining a Primary Key class, ugh), and a Sheep can have a .sire and a .dam, which internally is stored with three columns in the table each, and that a sire/dam must be older than the sheep in question (and of particular gender), then some form of restricting which data can be assigned to the sire and dam attributes (which, using SQL Alchemy are references to other objects in the table).

Thus, I can have the following lines in the class definition for Sheep:

1         sire = property(_getSire, _setSire)
2         dam = property(_getDam, _setDam)

The crux of this post is that now, if you attempt to set a sire, using: sheep.sire = otherSheep; then it will pass the contents of otherSheep to the _setSire method of the Sheep class.

This goes even further - it now allows me to either use a Ram object, or a Sheep object (and uses some checking to ensure that it is a Gender=M), or even just a string that is the string representation of a sheep - the value that I am using for the three parts of the primary key. It will then look up the sheep in question, or create a new one if it doesn’t exist in the database.

Thus, a sheep is identified to the real world by three things: it’s flock number, such as 160188, the year it was born, like 2004, and the tag number, such as 040001, which is not unique between flocks, or necessarily years. For the flock this system is designed for, the tag number is unique for sheep of a given century, which would generally suffice, but in a situation where there were more than 9999 sheep born every year in a flock, then you wouldn’t just be able to use the last two digits of the year for the start of the tag number.

Thus a sheep might be declared as 160188-2007-070001, which is a unique identifier of that sheep in the world (or at least Australia!), but having this as a primary key on it’s own means more jigglery would be required to get the flock number and year. By storing as three seperate parts it is possible to keep an easy reference to the flock, and the year. There is no chance of a data integrity error as there might be if I stored the id, then the year and flock number. It would be possible then to change one without the others being udpated.

I’ve also done something quite clever with the mapping of the sire/dam relationship. By having different classes for Ram and Ewe, and mapping sire/dam to these, it is possible to use the same backreference term, offspring, to generate a list of sheep who are offspring of the Ram or Ewe in question. If you don’t use subclasses, then it isn’t nearly as neat.

If you can’t tell, I’m loving SQL Alchemy. I had some issues figuring out what the hell was going wrong with my relations between sheep and sire, but it turned out that you have to use the remote_side argument to the relation function to do what I wanted. Funnily, it was working with dam, but not with sire, but now it’s much more solid. No more circular reference constraints either.

By the way, this project is going to be an online database for livestock information, allowing potential purchasers to research pedigrees online. If anyone else is interested in purchasing it, send me an email. At this stage, it is sheep only, but will be easy to change to, say cattle if they are your thing…