On Django and Check Constraints

I’ve come to love Django, and now I seem to be spending not only my work time, but also my own time working on one django project or another. Today at work I came across the need to have check constraints in a database. For instance, this can be used to ensure that the start of an event is before the end of the event, or that a certain value has a particular range of values.

For the SQL savvy, it can look like:

CREATE TABLE event (
…more columns here…
start datetime,
finish datetime,
count integer CHECK (count >= 0),
CONSTRAINT event_starts_before_finish CHECK (start < finish)
);

There are two types of constraint here: the column-level constraint, which in this instance can be done with django, and the table-level constraint, which in many systems cannot be a part of a column definition. In PostgreSQL, my DBMS of choice, it is possible to define a constraint that affects more than one column as part of a column, but I prefer not to. Note also that I am using the syntax form that allows me to name the constraint. This means I can alter or drop it later.

As I mentioned, django can do the first type of constraint, as long as it is a > 0 constraint, using one of the field types that subclasses PositiveInteger. However, there is no functionality built into django to do the latter. And I would like to use them.

It is possible to just add the constraints to an already existing database: indeed, that is what I did for work. Having the constraints at the database level means that, since I have more than one interface to my datastore (don’t ask: one of them is for an old SOAP interface I need to keep around), I want to ensure that even if someone accesses it outside of django, they cannot put in data that breaks these constraints. Similarly, if I use an interface other than the admin interface, or heaven forbid, open up the database in raw form, I cannot accidentally put in data that breaks this validation.

But, pure database level constraints don’t give very nice feedback in django. It is nicer to have the pretty red boxes around my field than the traceback. So, I want the validation to occur on the field level as well. As long as I am using django’s forms (and my API for RESTful access will use them for validation), then I will have these errors nicely presented.

So, to that end, I have created some code that allows for the definition and enforcement of both of these types of constraint.

The column form allows for a new keyword argument to a field:

count = models.IntegerField(constraint=">= 0")

Notably, the string that can be passed in must conform to the pattern “cmp value”, where cmp is one of the comparison operators (<, >, <=, >=, =, !=), and value is a valid value for this column type. It will be passed to the to_python() method of this field when comparing. There must be a string between the two parts.

The other form is a new attribute on the Meta class of a model.

check_constraints = (‘start < finish’,)  

This must be a tuple of strings, where a string is of the form “column cmp column”. Again, there must be a space either side of cmp, and each column name must be a valid column. Not meeting these criteria will result in a validation error.

From these definitions, the database constraints will be applied, and validation of forms will also occur.

I have deliberately made the constraints simple (ie, not callable objects) so that they can easily be converted to database constraints. For instance, they can effectively be transferred straight through to the database (with the addition of the column name in the case of the column constraint).

I am going to create a ticket in the django trac, and submit a patch. Guessing I should write up some test cases, though!

Navicat now does Oracle.

According to an item that just popped into my feed, there is now a version of Navicat that connects to Oracle DBs.

Cool. I like Navicat.

Python Properties

One of the criticisms of many languages is that they are so complex, that people do not use all of the features of them. Meaning that one person might write a program in a language that another, fluent reader/writer of that language may not be able to understand. I think this is just rubbish, at least to some respect. If you see some code in a language that doesn’t make sense, then hopefully you should be able to either figure it out, or look up in the documentation (what, your language doesn’t have complete online, searchable documentation?) what is going on.

A similar thing happened to me the other day – and I didn’t even need to look up in the Python documentation to see what the story was.

People often criticise python because it lacks the ability to mark a variable as private, or protected. All attributes of a class are automatically public, and it is only really a guideline that if you prefix a variable with one or two underscores, you are marking it as protected or private. Namespace mangling means that this becomes slightly more than just a guideline, IIRC, although I’m not really sure of how this all works. Basically, the theory is that if you prefix an attribute, that is a marker to users of your API that you really shouldn’t use this.

I’ve often been frustrated when writing Java programs that attributes aren’t accessible, but I can see the value in only allowing protected access, through getters and setters. You can do this to prevent user classes/objects from putting garbage data into your attributes. This becomes even more important in Python, as its dynamic (but still strong) typing means I could put a “GOO” in where you expected an integer.

I discovered yesterday, as I was implementing a database in SQL Alchemy, that python has the ability to only allow protected access, through getters and setters, using the property function.

So, for this project, I have a database, where Sheep objects are stored, along with a variety of attributes. Sheep can be either Rams or Ewes, but the data that is stored about a Ram is virtually identical as that stored about a Ewe. However, using polymorphic mapping, I can have Ram and Ewe objects both stored in the Sheep table, since they are both sub-classes. Furthermore, I can use the gender attribute to determine if the object retrieved from the database should be a Ram or a Ewe. This is much better than anything I was able to do with J2EE, at least as far as I could find.

Because I am using a composite primary key (again, not something that is easy to do with J2EE, without defining a Primary Key class, ugh), and a Sheep can have a .sire and a .dam, which internally is stored with three columns in the table each, and that a sire/dam must be older than the sheep in question (and of particular gender), then some form of restricting which data can be assigned to the sire and dam attributes (which, using SQL Alchemy are references to other objects in the table).

Thus, I can have the following lines in the class definition for Sheep:

1         sire = property(_getSire, _setSire)
2         dam = property(_getDam, _setDam)

The crux of this post is that now, if you attempt to set a sire, using: sheep.sire = otherSheep; then it will pass the contents of otherSheep to the _setSire method of the Sheep class.

This goes even further – it now allows me to either use a Ram object, or a Sheep object (and uses some checking to ensure that it is a Gender=M), or even just a string that is the string representation of a sheep – the value that I am using for the three parts of the primary key. It will then look up the sheep in question, or create a new one if it doesn’t exist in the database.

Thus, a sheep is identified to the real world by three things: it’s flock number, such as 160188, the year it was born, like 2004, and the tag number, such as 040001, which is not unique between flocks, or necessarily years. For the flock this system is designed for, the tag number is unique for sheep of a given century, which would generally suffice, but in a situation where there were more than 9999 sheep born every year in a flock, then you wouldn’t just be able to use the last two digits of the year for the start of the tag number.

Thus a sheep might be declared as 160188-2007-070001, which is a unique identifier of that sheep in the world (or at least Australia!), but having this as a primary key on it’s own means more jigglery would be required to get the flock number and year. By storing as three seperate parts it is possible to keep an easy reference to the flock, and the year. There is no chance of a data integrity error as there might be if I stored the id, then the year and flock number. It would be possible then to change one without the others being udpated.

I’ve also done something quite clever with the mapping of the sire/dam relationship. By having different classes for Ram and Ewe, and mapping sire/dam to these, it is possible to use the same backreference term, offspring, to generate a list of sheep who are offspring of the Ram or Ewe in question. If you don’t use subclasses, then it isn’t nearly as neat.

If you can’t tell, I’m loving SQL Alchemy. I had some issues figuring out what the hell was going wrong with my relations between sheep and sire, but it turned out that you have to use the remote_side argument to the relation function to do what I wanted. Funnily, it was working with dam, but not with sire, but now it’s much more solid. No more circular reference constraints either.

By the way, this project is going to be an online database for livestock information, allowing potential purchasers to research pedigrees online. If anyone else is interested in purchasing it, send me an email. At this stage, it is sheep only, but will be easy to change to, say cattle if they are your thing…