Multi-table Inheritance and the Django Admin

Django’s admin interface is a great way to be able to interact with your models without having to write any view code, and, within limits, it’s useful in production too. However, it can quickly get very crowded when you register lots of models.

Consider the situation where you are using Django’s multi-table inheritance:

from django.db import models

from model_utils.managers import InheritanceManager

class Sheep(models.Model):
    sheep_id = models.AutoField(primary_key=True)
    tag_id = models.CharField(max_length=32)
    date_of_birth = models.DateField()
    sire = models.ForeignKey('sheep.Ram', blank=True, null=True, related_name='progeny')
    dam = models.ForeignKey('sheep.Ewe', blank=True, null=True, related_name='progeny')

    objects = InheritanceManager()

    class Meta:
        verbose_name_plural = 'sheep'

    def __str__(self):
        return '{}: {}'.format(self._meta.verbose_name, self.tag_id)


class Ram(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ram'
        verbose_name_plural = 'rams'


class Ewe(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ewe'
        verbose_name_plural = 'ewes'

Ignore the fact there is no specialisation on those child models: in practice you’d normally have some.

Also note that I’ve manually included the primary key, and the parent link fields. This has been done so that the actual columns in the database match, and in this case will all be sheep_id. This will make writing joins slightly simpler, and avoids the (not specific to Django) ORM anti-pattern of “always have a column named id”.

We can use the models like this, but it might be nice to have all sheep in the one admin changelist, and just allow filtering by subclass model.

First, we’ll put some extra stuff onto the parent model, to make obtaining the subclasses simpler. Some of these will use a new decorator, which creates a class version of the @property decorator.

class classproperty(property):
    def __get__(self, cls, owner):
      return self.fget.__get__(None, owner)()


class Sheep(models.Model):
    # Fields, etc. defined as above.

    @classproperty
    @classmethod
    def SUBCLASS_OBJECT_CHOICES(cls):
        "All known subclasses, keyed by a unique name per class."
        return {
          rel.name: rel.related_model
          for rel in cls._meta.related_objects
          if rel.parent_link
        }

    @classproperty
    @classmethod
    def SUBCLASS_CHOICES(cls):
        "Available subclass choices, with nice names."
        return [
            (name, model._meta.verbose_name)
            for name, model in cls.SUBCLASS_OBJECT_CHOICES.items()
        ]

    @classmethod
    def SUBCLASS(cls, name):
        "Given a subclass name, return the subclass."
        return cls.SUBCLASS_OBJECT_CHOICES.get(name, cls)

Note that we don’t need to enumerate the subclasses: adding a new subclass later in development will automatically add it to these properties, even though in this case it would be unlikely to happen.

From these, we can write some nice neat stuff to enable using these in the admin.

from django import forms
from django.conf.urls import url
from django.contrib import admin
from django.utils.translation import ugettext as _

from .models import Sheep


class SubclassFilter(admin.SimpleListFilter):
    title = _('gender')
    parameter_name = 'gender'

    def lookups(self, request, model_admin):
      return Sheep.SUBCLASS_CHOICES

    def queryset(self, request, queryset):
      if self.value():
        return queryset.exclude(**{self.value(): None})
      return queryset


@admin.register(Sheep)
class SheepAdmin(admin.ModelAdmin):
    list_display = [
        'tag_id',
        'date_of_birth',
        'gender'
    ]
    list_filter = [SubclassFilter]

    def get_queryset(self, request):
      return super(SheepAdmin, self).get_queryset(request).select_subclasses()

    def gender(self, obj):
        return obj._meta.verbose_name

    def get_form(self, request, obj=None, **kwargs):
        if obj is None:
            Model = Sheep.SUBCLASS(request.GET.get('gender'))
        else:
            Model = obj.__class__

        # When we change the selected gender in the create form, we want to reload the page.
        RELOAD_PAGE = "window.location.search='?gender=' + this.value"
        # We should also grab all existing field values, and pass them as query string values.

        class ModelForm(forms.ModelForm):
            if not obj:
                gender = forms.ChoiceField(
                    choices=[('', _('Please select...'))] + Sheep.SUBCLASS_CHOICES,
                    widget=forms.Select(attrs={'onchange': RELOAD_PAGE})
                )

            class Meta:
                model = Model
                exclude = ()

        return ModelForm

    def get_fields(self, request, obj=None):
        # We want gender to be the first field.
        fields = super(SheepAdmin, self).get_fields(request, obj)

        if 'gender' in fields:
            fields.remove('gender')
            fields = ['gender'] + fields

        return fields

    def get_urls(self):
        # We want to install named urls that match the subclass ones, but bounce to the relevant
        # superclass ones (since they should be able to handle rendering the correct form).
        urls = super(SheepAdmin, self).get_urls()
        existing = '{}_{}_'.format(self.model._meta.app_label, self.model._meta.model_name)
        subclass_urls = []
        for name, model in Sheep.SUBCLASS_OBJECT_CHOICES.items():
            opts = model._meta
            replace = '{}_{}_'.format(opts.app_label, opts.model_name)
            subclass_urls.extend([
                url(pattern.regex.pattern, pattern.callback, name=pattern.name.replace(existing, replace))
                for pattern in urls if pattern.name
            ])

        return urls + subclass_urls

Wow. There’s quite a lot going on there, but the summary is:

  • We create a custom filter that filters according to subclass.
  • The .select_subclasses() means that objects are downcast to their subclass when fetched.
  • There is a custom form, that, when in create mode, has a selector for the desired subclass.
  • When the subclass is changed (only on the create form), the page is reloaded. This is required in a situation where there are different fields on each of the subclass models.
  • We register the subclass admin url paths, but use the superclass admin views.

I’ve had ideas about this for some time, and have just started using something like this in development: in my situation, there will be an arbitrary number of subclasses, all of which will have several new fields. The code in this page is extracted (and changed) from those ideas, so may not be completely correct. Corrections welcome.

(Directly) Testing Django Formsets

Django Forms are excellent: they offer a really nice API for validating user input. You can quite easily pass a dict of data instead of a QueryDict, which is what the request handling mechanism provides. This makes it trivial to write tests that exercise a given Form’s validation directly. For instance:

def test_my_form(self):
    form = MyForm({
        'foo': 'bar',
        'baz': 'qux'
    })
    self.assertFalse(form.is_valid())
    self.assertTrue('foo' in form.errors)

Formsets are also really nice: they expose a neat way to update a group of homogenous objects. It’s possible to pass a list of dicts to the formset for the initial argument, but, alas, you may not do the same for passing data. Instead, it needs to be structured as the QueryDict would be:

def test_my_formset(self):
    formset = MyFormSet({
        'formset-INITIAL_FORMS': '0',
        'formset-TOTAL_FORMS': '2',
        'formset-0-foo': 'bar1',
        'formset-0-baz': 'qux1',
        'formset-1-foo': 'spam',
        'formset-1-baz': 'eggs'
    })
    self.assertTrue(formset.is_valid())

This is fine if you only have a couple of forms in your formset, but it’s a bit tiresome to have to put all of the prefixes, and is far noisier.

Here’s a nice little helper, that takes a FormSet class, and a list (of dicts), and instantiates the formset with the data coerced into the correct format:

def instantiate_formset(formset_class, data, instance=None, initial=None):
    prefix = formset_class().prefix
    formset_data = {}
    for i, form_data in enumerate(data):
        for name, value in form_data.items():
            if isinstance(value, list):
                for j, inner in enumerate(value):
                    formset_data['{}-{}-{}_{}'.format(prefix, i, name, j)] = inner
            else:
                formset_data['{}-{}-{}'.format(prefix, i, name)] = value
    formset_data['{}-TOTAL_FORMS'.format(prefix)] = len(data)
    formset_data['{}-INITIAL_FORMS'.format(prefix)] = 0

    if instance:
        return formset_class(formset_data, instance=instance, initial=initial)
    else:
        return formset_class(formset_data, initial=initial)

This handles a formset or a model formset. Much easier to use:

def test_my_formset(self):
    formset = instantiate_formset(MyFormSet, [
      {
        'foo': 'bar1',
        'baz': 'qux1',
      },
      {
        'foo': 'spam',
        'baz': 'eggs',
      },
    ])

Using other Python versions with Codeship.

Codeship is pretty cool, other than their requirement to log in to view even public builds. They support Python to some extent, even going as far as creating and activating a virtualenv for your test environment.

However, I like to use tox to do matrix testing against packages, and try to cover as many cases as possible. For instance, for django-boardinghouse, I currently test against:

  • Python 2.7
  • Python 3.3
  • Python 3.4
  • Python 3.5
  • pypy
  • pypy3

…and Django 1.7 through 1.9. In most cases, each version of python should be tested with each version of django. In practice, there are some exceptions.

However, Codeship only have Python 2.7.6 and 3.4.0 installed.

You can run arbitrary code as part of your test/setup, but you can’t install stuff using sudo. Instead, I wrote a script that can be called from within the test setup that installs other pythons:

# We already have some versions of python, but want some more...
cd ~/src

mkdir -p pypy
cd pypy
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.0.1-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy-5.0.1-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p pypy3
cd pypy3
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy3-2.4-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy3-2.4-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p ~/.local
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tar.xz
tar xvf Python-3.5.1.tar.xz
cd Python-3.5.1
./configure --prefix=/home/$USER/.local/
make
make install

# You actually need to put this line in the tests section. Not sure of a better solution.
export PATH=$PATH:~/src/pypy3/bin:~/src/pypy/bin:~/.local/bin/

I have this as a reusable snippet on BitBucket: codeship helper scripts, however as mentioned you need to grab the export PATH=... section and stick that in the tests section. Also notably you get a different URL for the raw version of each revision, which is actually really good, because it means someone cannot change the code between you inspecting it an executing it.

In my case, I have a line in the test setup:

curl https://bitbucket.org/\!api/2.0/snippets/schinckel/oKXKy/c7cc02bcd96d4a8f444cd997d5c3bc0bb92106d6/files/install-python.sh | sh

Also of note is that pypy* have a pre-built version, which is much faster than building from source, however there doesn’t seem to be a non-rpm version of Python 3.5.

Token Input Widget

I spent a fair bit of time on the weekend tweaking a Token Input widget.

I really like the way I show/hide the editing field, using no JS at all. That’s not to say there isn’t quite a bit of JS, but I am happy to say it’s using no jQuery, Knockout or anything else.

Indeed, I first wrote it in Knockout, but I think this one is actually simpler (and has some behaviour I wasn’t able to obtain with my Knockout models).

Django Trees via Closure View

After writing up a method of using a Postgres View that generates a materialised path within the context of a Django model, I came across some queries of my data that were getting rather troublesome to write. It occurred to me that having a closure table would be useful. Specifically, I needed all of the descendants of a given set of nodes.

I couldn’t find an existing Postgres extension that will manage the closure table, and didn’t feel like writing my own implemention using triggers just yet. However, it occurred to me that I could use a similar trick to the recursive materialised path view. Thus, we have a Closure View.

We will start with the Django models:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.Node', related_name='children', null=True, blank=True)

    descendants = models.ManyToManyField('tree.Node', related_name='ancestors', through='tree.Closure')

    class Meta:
        app_label = 'tree'


class Closure(models.Model):
    path = ArrayField(base_field=models.IntegerField(), primary_key=True)
    ancestor = models.ForeignKey('tree.Node', related_name='+')
    descendant = models.ForeignKey('tree.Node', related_name='+')
    depth = models.IntegerField()

    class Meta:
        app_label = 'tree'
        managed = False

You may notice I have a path column. I’m using this for the primary key, and it may turn out to be useful later.

Let’s have a look at the View:

CREATE RECURSIVE VIEW tree_closure(path, ancestor_id, descendant_id, depth) AS

SELECT ARRAY[node_id], node_id, node_id, 0 FROM tree_node

UNION ALL

SELECT parent_id || path, parent_id, descendant_id, depth + 1
FROM tree_node INNER JOIN tree_closure ON (ancestor_id = node_id)
WHERE parent_id IS NOT NULL;

This uses a recursive query. The first part builds the self-reference relations, and the second part uses the RECURSIVE function to collect child nodes for each node already in the table (or added in previous iterations of this part of the view).

Now, because we are using the in-built Django Many to Many features, we have some nice queries ready to go:

  • node.ancestors.all() : All ancestors of a given Node instance.
  • node.descendants.all() : All descendants of a given Node instance.
  • Node.objects.filter(ancestors=queryset) : All descendants of all nodes in a queryset.
  • Node.objects.filter(descendants=queryset) : All ancestors of all nodes in a queryset.

Of particular note are the bottom two: these are rather cumbersome to write in older versions of Django.

Adjacency Lists in Django with Postgres

Today, I’m going to walk through modelling a tree in Django, using an Adjacency List, and a Postgres View that dynamically creates the materialised path of ancestors for each node.

With this, we will be able to query the tree for a range of operations using the Django ORM.

We will start with our model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.node', related_name='children', null=True, blank=True)

    class Meta:
        app_label = 'tree'

We will also build an unmanaged model that will be backed by our view.

from django.contrib.postgres.fields import ArrayField

class Tree(models.Model):
    root = models.ForeignKey(Node, related_name='+')
    node = models.OneToOneField(Node, related_name='tree_node', primary_key=True)
    ancestors = ArrayField(base_field=models.IntegerField())

    class Meta:
        app_label = 'tree'
        managed = False

You’ll notice I’ve included a root relation. This could be obtained by using ancestors[0] if ancestors else node_id, but that’s a bit cumbersome.

So, on to the View:

CREATE RECURSIVE VIEW tree_tree(root_id, node_id, ancestors) AS

SELECT node_id, node_id, ARRAY[]::INTEGER[]
FROM tree_node WHERE parent_id IS NULL

UNION ALL

SELECT tree.root_id, node.node_id, tree.ancestors || node.parent_id
FROM tree_node node INNER JOIN tree_tree tree ON (node.parent_id = tree.node_id)

I’ve written this view before, so I won’t go into any detail.

We can create a tree. Normally I wouldn’t specify the primary key, but since we want to talk about those values shortly, I will. It also means you can delete them, and recreate with this code, and not worry about the sequence values.

from tree.models import Node

Node.objects.bulk_create([
  Node(pk=1),
  Node(pk=2, parent_id=1),
  Node(pk=3, parent_id=1),
  Node(pk=4, parent_id=2),
  Node(pk=5, parent_id=2),
  Node(pk=6, parent_id=3),
  Node(pk=7, parent_id=3),
  Node(pk=8, parent_id=4),
  Node(pk=9, parent_id=8),
  Node(pk=10),
  Node(pk=11, parent_id=10),
  Node(pk=12, parent_id=11),
  Node(pk=13, parent_id=11),
  Node(pk=14, parent_id=12),
  Node(pk=15, parent_id=12),
  Node(pk=16, parent_id=12),
])

Okay, let’s start looking at how we might perform some operations on it.

We’ve already seen how to create a node, either root or leaf nodes. No worries there.

What about inserting an intermediate node, say between 11 and 12?

node = Node.objects.create(parent_id=11)
node.parent.children.exclude(pk=node.pk).update(parent=node)

I’m not sure if it is possible to do it in a single statement.

Okay, let’s jump to some tree-based statements. We’ll start by finding a sub-tree.

Node.objects.filter(tree_node__ancestors__contains=[2])

Oh, that’s pretty nice. It’s not necessarily sorted, but it will do for now.

We can also query directly for a root:

Node.objects.filter(tree_node__root=10)

We could spell that one as tree_node__ancestors__0=10, but I think this is more explicit. Also, that one will not include the root node itself.

Deletions are also simple: if we can build a queryset, we can delete it. Thus, deleting a full tree could be done by following any queryset by a .delete()

Fetching a node’s ancestors is a little trickier: because we only have an array of node ids; thus it does two queries.

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors)

The count of ancestors doesn’t require the second query:

len(Node.objects.get(pk=15).tree_node.ancestors)

Getting ancestors to a given depth is also simple, although it still requires two queries:

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors[-2:])

This is a fairly simple way to enable relatively performance-aware queries of tree data. There are still places where it’s not perfect, and in reality, you’d probably look at building up queryset or model methods for wrapping common operations.

Postgres Tree Shootout part 3: Adjacency List using Views

It’s been a while, but I’ve finally gotten off my arsefound some time to revisit this series. As promised last time, I’m going to rewrite the queries from the Adjacency List “solutions” using a View. Indeed, there will be two versions of the view - one which is a MATERIALIZED VIEW. There will also be discussion of when the two different types of view might be best to use.

One of the reasons this post took so long to write was that I was sidetracked by writing an SVG generator that would allow for graphically seeing what the different operations discussed in this series look like in terms of an actual tree. That didn’t eventuate.

We will start by defining what our tree view will actually look like. You’ll notice is it rather like the CTE that we saw in the previous post.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
);

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

All of the insertions do not require access to the tree view, since the beauty of an Adjacency List model is that you only ever need to operate on the immediate parent-child.

Removals

Similarly, we will skip over the simple operations: those don’t require access to any more of the tree than just the parent-child relationship. It’s not until we need to remove a subtree that it becomes interesting.

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
) OR node_id = 2;

If you are paying attention, you will notice that this is virtually identical to the CTE version, except that we no longer need to redeclare the CTE each time. The full tree deletion is the same, as is removing all decscendants:

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
);

Moves

Again, the operations that don’t require the actual tree are unchanged: this is where the Adjacency List really shines.

Fetches

Since we are starting with the “full” tree, we should be able to use it for all of the queries. It is possible that these queries (unlike those we have seen before) may be slightly slower than the CTE version (specifically, those where the CTE is customised for that operation).

Descendants

Let’s get all of node 10’s descendants:

SELECT node_id FROM tree WHERE 10 = ANY(ancestors);

This query is far less complicated than the CTE version, as expected. However, when dealing with very large datasets, it performs far worse. I have a data set with 221000 nodes, in 1001 different trees. Performing this query takes around 5 seconds, but the customised CTE version takes about 750ms.

Turning this view into a materialised view:

CREATE MATERIALIZED VIEW tree_mat AS
SELECT node_id, ancestors FROM tree;

and then querying that turns this into around 75ms.

To limit the query to nodes to a given depth requires slightly more work.

SELECT node_id, ancestors FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

Ancestors

Fetching ancestors of a node is again trivial:

SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

And the count of ancestors:

SELECT ARRAY_LENGTH(ancestors, 1) FROM tree WHERE node_id=15;

Getting a set of ancestors to a given depth is actually a little tricky: because we can’t just reverse the end that we add the parent node to the ancestors array, we can’t use that trick. We’ll have to enumerate the rows, and then extract those we care about. You can’t use OFFSET with a variable, otherwise that would be a nice trick.

WITH ancestors AS (
  SELECT unnest(ancestors) AS node_id
  FROM tree
  WHERE node_id = 15
), enumerated AS (
  SELECT
    row_number() OVER () AS row,
    count(*) OVER () AS ancestor_count,
    node_id
  FROM ancestors
)
SELECT node_id
FROM enumerated
WHERE "row" > ancestor_count - 2;

Ugh. That’s way worse than the CTE version.

Special queries

None of the special queries access the tree in any way, so can be omitted for now.

Discussion

So how does using a view stack up to the ad-hoc CTE queries?

Mostly pretty well. In the case where you have only small data sets, then the view that builds up the complete tree each time is not that much of a problem (I ran some tests with tens of thousands of items, and it still performed relatively well). When it moves up to hundreds of thousands, then the ad-hoc CTE versions can greatly outperform the full tree view.

However, using a materialised view changes everything. It now becomes just as fast as querying a table: indeed, that’s just what it is. You could have triggers based on changes to the nodes table causing a REFRESH MATERIALIZED VIEW, but it is worth keeping in mind that this will take some time: in my case, a full refresh of 221000 rows took upwards of 4.5 seconds.

Using a materialised view gets us most of the way to (and leads nicely into the next method), storing a materialised path. The similarity of the names here should be a trigger, but now I’m just making foreshadowing jokes.

Tree data as a nested list

One of the nice things about Adjacency Lists as a method of storing tree structures is that there is not much redundancy: you only store a reference to the parent, and that’s it.

It does mean that getting that data in a nested object is a bit complicated. I’ve written before about getting data out of a database: I’ll revisit that again I’m sure, but for now, I’m going to deal with data that has the following shape: that is, has been built up into a Materialized Path:

[
  {
    "node": 1,
    "ancestors": [],
    "label": "Australia"
  },
  {
    "node": 2,
    "ancestors": [1],
    "label": "South Australia"
  },
  {
    "node": 3,
    "ancestors": [1],
    "label": "Victoria"
  },
  {
    "node": 4,
    "ancestors": [1, 2],
    "label": "South-East"
  },
  {
    "node": 5,
    "ancestors": [1, 3],
    "label": "Western Districts"
  },
  {
    "node": 6,
    "ancestors": [],
    "label": "New Zealand"
  },
  {
    "node": 7,
    "ancestors": [1, 2],
    "label": "Barossa Valley"
  },
  {
    "node": 8,
    "ancestors": [1, 2],
    "label": "Riverland"
  }
]

From here, we want to build up something that looks like:

  • Australia
    • South Australia
      • Barossa Valley
      • Riverland
      • South East
    • Victoria
      • Western Districts
  • New Zealand

Or, a nested python data structure:

[
  ('Australia', [
    ('South Australia', [
      ('Barossa Valley', []),
      ('Riverland', []),
      ('South-East', [])
    ]),
    ('Victoria', [
      ('Western Districts', [])
    ])
  ]),
  ('New Zealand', [])
]

You’ll see that each node is a 2-tuple, and each set of siblings is a list. Even a node with no children still gets an empty list.

We can build up this data structure in two steps: based on the fact that a dict, as key-value pairs, matches a 2-tuple. That is, we will start by creating:

{
  1: {
    2: {
      4: {},
      7: {},
      8: {},
    },
    3: {
      5: {},
    }
  },
  6: {},
}

You might be reaching for python’s defaultdict class at this point, but there is a slightly nicer way:

class Tree(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

(Note: This class, and the seed of the idea, came from this answer on StackOverflow).

We can also create a recursive method on this class that creates a node and all of it’s ancestors:

    def insert(self, key, ancestors):
        if ancestors:
            self[ancestors[0]].insert(key, ancestors[1:])
        else:
          self[key]
>>> tree = Tree()
>>> for node in data:
...     tree.insert(node['node'], node['ancestors'])
>>> print tree
{1: {2: {8: {}, 4: {}, 7: {}}, 3: {5: {}}}, 6: {}}

Looking good.

Let’s make another method that allows us to actually insert the labels (and apply a sort, if we want):

    def label(self, label_dict, sort_key=lambda x: x[0]):
        return sorted([
          (label_dict.get(key), value.label(label_dict, sort_key))
          for key, value in self.items()
        ], key=sort_key)

We also need to build up the simple key-value store to pass as label_dict, but that’s pretty easy.

Let’s look at the full code: first the complete class:

class Tree(dict):
    """Simple Tree data structure

    Stores data in the form:

    {
        "a": {
            "b": {},
            "c": {},
        },
        "d": {
            "e": {},
        },
    }

    And can be nested to any depth.
    """

    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

    def insert(self, node, ancestors):
        """Insert the supplied node, creating all ancestors as required.

        This expects a list (possibly empty) containing the ancestors,
        and a value for the node.
        """
        if not ancestors:
            self[node]
        else:
            self[ancestors[0]].insert(node, ancestors[1:])

    def label(self, labels, sort_key=lambda x: x[0]):
        """Return a nested 2-tuple with just the supplied labels.

        Realistically, the labels could be any type of object.
        """
        return sorted([
            (
                labels.get(key),
                value.label(labels, sort_key)
            ) for key, value in self.items()
        ], key=sort_key)

Now, using it:

>>> tree = Tree()
>>> labels = {}
>>>
>>> for node in data:
>>>     tree.insert(node['node'], node['ancestors'])
>>>     labels[node['node']] = node['label']
>>>
>>> from pprint import pprint
>>> pprint(tree.label(labels))

[('Australia',
  [('South Australia',
    [('Barossa Valley', []), ('Riverland', []), ('South-East', [])]),
   ('Victoria', [('Western Districts', [])])]),
 ('New Zealand', [])]

Awesome. Now use your template rendering of choice to turn this into a nicely formatted list.

slugify() for postgres (almost)

A recent discussion in #django suggested “what we need is a PG slugify function”.

The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.

Function slugify(value, allow_unicode=False).

  • Convert to ASCII if allow_unicode is false
  • Remove characters that aren’t alphanum, underscores, hyphens
  • Strip leading/trailing whitespace
  • Convert to lowercase
  • Convert spaces to hyphens
  • Remove repeated hyphens

(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).

We can write an SQL function that uses the Postgres unaccent extension to get pretty close:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\w\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";

$$ LANGUAGE SQL STRICT IMMUTABLE;

I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\w\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\s]+', '-', 'gi'
  ) FROM "normalized";

$$ LANGUAGE SQL STRICT IMMUTABLE;

To get the default value for the second argument, we can have an overloaded version with only a single argument:

CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;

Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:

BEGIN;

SELECT plan(7);

SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');

SELECT * FROM finish();

ROLLBACK;

And we get one failing test:

=# SELECT is(slugify('你好', true), '你好');

          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)

Time: 2.004 ms

It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.

Row Level Security in Postgres and Django

Postgres keeps introducing new things that pique my attention. One of the latest ones of these is Row Level Permissions, which essentially hides rows that a given database user cannot view. There’s a bit more to it than that, a good writeup is at Postgres 9.5 feature highlight: Row-Level Security and Policies.

However, it’s worth noting that the way Django connects to a database uses a single database user. Indeed, if your users table is in your database, then you’ll need some way to connect to it to authenticate. I haven’t come up with a nice way to use the Postgres users for authentication within Django just yet.

I did have an idea about a workflow that may just work.

  • Single Postgres User is used for authentication (Login User).
  • Every Django user gets an associated Postgres User (Session User), that may not log in.
  • This Session User is automatically created using a Postgres trigger, whenever the Django users table is updated.
  • After authentication, a SET SESSION ROLE (or SET SESSION AUTHORIZATION) statement is used to change to the correct Session User for the remainder of the session.

Then, we can implement the Postgres Row Level Security policies as required.

Initially, I had thought that perhaps the Session Users would have the same level of access as the Login User, however tonight it occurred to me that perhaps this could replace the whole Django permissions concept.


We do have a few things we need to work out before this is a done deal.

  • The trigger function that performs the CREATE ROLE statement when the django users table is updated.
  • Some mechanism of handling GRANT and REVOKE statements.
  • Similarly, some mechanism for showing current permissions for the given user.
  • A middleware that sets the SESSION USER according to the django user.

The simplest part of this is the last one, so we will start there. We can (in the meantime) manually create the users and their permissions to see how well it all goes. No point doing work that doesn’t work.

from django.db import connection


class SetSessionAuthorization(object):
    def process_view(self, request, *args, **kwargs):
        if request.user.pk:
          connection.cursor().execute(
            'SET SESSION SESSION AUTHORIZATION "django:{}"'.format(request.user.pk)
          )

We need to add this to our project’s middleware.

You’ll see we are using roles of the form django:<id>, which need to be quoted. We use the user id rather than the username, because usernames may be changed.

We’ll want to create a user for each of the existing Django users: I currently have a single user in this database, with id 1. I also have an existing SUPERUSER with the name django. We need to use a superuser if we are using SET SESSION AUTHORIZATION, which seems to be the best. I haven’t found anything which really does a good job of explaining how this and SET SESSION ROLE differ.

CREATE USER "django:1" NOLOGIN;
GRANT "django:1" TO django;
GRANT ALL ON ALL TABLES IN SCHEMA public TO public;
GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO public;

Note we have just for now enabled full access to all tables and sequences. This will remain until we find a good way to handle this.

We can start up a project using this, and see if it works. Unless I’ve missed something, then it should.

Next, we will turn on row-level-security for the auth_user table, and see if it works.

ALTER TABLE auth_user ENABLE ROW LEVEL SECURITY;

Then try to view the list of users (even as a django superuser). You should see an empty list.

We’ll turn on the ability to see our own user object:

CREATE POLICY read_own_data ON auth_user FOR
SELECT USING ('django:' || id = current_user);

Phew, that was close. Now we can view our user.

However, we can’t update it. Let’s fix that:

CREATE POLICY update_own_user_data ON auth_user FOR
UPDATE USING ('django:' || id = current_user)
WITH CHECK ('django:' || id = current_user);

We should be able to do some magic there to prevent a user toggling their own superuser status.

Let’s investigate writing a trigger function that creates a new ROLE when we update the django user.

CREATE OR REPLACE FUNCTION create_shadow_role()
RETURNS TRIGGER AS $$

BEGIN

  EXECUTE 'CREATE USER "django:' || NEW.id || '" NOLOGIN';
  EXECUTE 'GRANT "django:' || NEW.id || '" TO django';

  RETURN NULL;

END;

$$
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path =  public, pg_temp
VOLATILE;


CREATE TRIGGER create_shadow_role
  AFTER INSERT ON auth_user
  FOR EACH ROW
  EXECUTE PROCEDURE create_shadow_role();

Note we still can’t create users from the admin (due to the RLS restrictions thata are there), so we need to resort to ./manage.py createsuperuser again.

Having done that, we should see that our new user gets a ROLE:

# \du
                                        List of roles
 Role name │                         Attributes                         │      Member of
───────────┼────────────────────────────────────────────────────────────┼─────────────────────
 django    │ Superuser                                                  │ {django:1,django:6}
 django:1  │ Cannot login                                               │ {}
 django:6  │ Cannot login                                               │ {}
 matt      │ Superuser, Create role, Create DB                          │ {}
 postgres  │ Superuser, Create role, Create DB, Replication, Bypass RLS │ {}

We should be able to write similar triggers for update. We can, for example, shadow the Django is_superuser attribute to the Postgres SUPERUSER attribute. I’m not sure if that’s a super idea or not.

But we can write a simple function that allows us to see if the current django user is a superuser:

CREATE FUNCTION is_superuser()
RETURNS BOOLEAN AS $$

SELECT is_superuser
FROM auth_user
WHERE 'django:' || id = current_user

$$ LANGUAGE SQL;

We can now use this to allow superuser access to all records:

CREATE POLICY superuser_user_select ON auth_user
FOR SELECT USING (is_superuser);

CREATE POLICY superuser_user_update ON auth_user
FOR UPDATE USING (is_superuser)
WITH CHECK (is_superuser);

That gives us a fair bit of functionality. We still don’t have any mechanism for viewing or setting permissions. Because of the way Django’s permissions work, we can’t quite use the same trick but on the auth_user_user_permissions table, because we’d need to also look at the auth_user_groups table and auth_group_permissions.

I’m still not sure if this is a good idea or not, but it is a fun thought process.