More JSONB querying

Occasionally, I get emails from people regarding specific queries in Postgres, usually because I have blogged about JSONB querying before.

Today, I got one: rather than just reply, I thought I’d blog about how queries could be written to solve this problem.

Our table can be a single column with JSONB data for the purposes of this.

CREATE TABLE priority (data JSONB);

We also need a bit of data to query:

INSERT INTO priority (data) VALUES (
'{
  "id": "02e32a14-904c-4153-a32b-fe8d1f1bbbe1",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 7}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "b33498b2-32f6-4575-b2cd-9e9a1ae2059d",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 4}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "a2d327d2-7668-4dc0-ae1d-d6144130e3ec",
  "entity": "activity",
  "fields": {
    "object": [],
    "subject": [
      {"val": "Object", "priority": 1},
      {"val": "Target", "priority": 7}
    ]
  }
}'), (
'{
  "id": "3bc8b536-00af-4fc7-881e-b88b620ac436",
  "entity": "activity",
  "fields": {
    "object": [
      {"val": "Object", "priority": 9}
    ]
  }
}'
);

The problem requires selection of the data rows where priority is greater than 5.

I’ve extended the data provided: I’m not sure if there will be multiple “fields”, but I assume so. I also assume that a match for any priority within a subject field will be required.

Lets start with a simpler version: get the records where the first fields->subject priority is greater than 5 (I’ll return just the id, to make it simpler):

SELECT data->'id'
  FROM priority
 WHERE (data#>>'{fields,subject,0,priority}')::INTEGER > 5;

 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"

This uses the #>> operator - which does a path lookup, and returns a string value, that we then cast to an integer for the comparison. Note that the path lookup differs from normal Postgres’ array indexing, in that it uses 0 as the first index, rather than 1.

But, we want to query for all rows where any subject field has a priority greater than 5.

We’ll want to use the jsonb_array_elements (which is the JSONB equivalent of unnest). We can use that to get the fields themselves:

SELECT jsonb_array_elements(data#>'{fields,subject}') FROM priority;

Note this uses the #> operator, because we still want JSONB data:

       jsonb_array_elements
──────────────────────────────────
 {"val": "Subject", "priority": 7}
 {"val": "Subject", "priority": 4}
 {"val": "Object", "priority": 1}
 {"val": "Target", "priority": 7}
(4 rows)

We can get a bit further too:

SELECT jsonb_array_elements(data#>'{fields,subject}')->'priority' FROM priority;

Indeed, we can get all the way to our boolean test:

SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER > 5 FROM priority;
 ?column?
─────────
 t
 f
 f
 t
(4 rows)

But we want the data rows themselves, not just the matching subject field, and this is not that useful. So, we can use the fact that jsonb_array_elements returns a set, and use that as a subquery in our WHERE clause, using the value operator ANY() construct:

SELECT data->'id'
  FROM priority
 WHERE 5 < ANY(SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER)

This means that we want only the records where 5 is less than any of the priority values in subject fields.

                ?column?
────────────────────────────────────────
 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"
 "a2d327d2-7668-4dc0-ae1d-d6144130e3ec"

I hope this helps, Paulo!

Django multitenancy using Postgres Row Level Security

Quite some time ago, I did some experiments in using Postgres Row Level Security (RLS) from within Django.

It occurred to me that this philosophy could be used to model a multi-tenant application.

The main big problem with django-boardinghouse is that you have to apply migrations to multiple schemata. With many tenants, this can take a long time. It’s not easy to do this in a way that would be conducive to having limited downtime.

On the other hand, RLS means that the database restricts which rows of specific tables need to be shown in a given circumstance. Normally, examples of RLS show this by using a different user, but this is not necessary.

In fact, in most modern web applications, a single database user is used for all connections. This has some big benefits (in that a connection to the database can belong to a pool, and be shared by different requests). Luckily, there are other ways to have RLS applied.

One method is to use Postgres’ session variables. This is outlined quite well in Application users vs. Row Level Security. I’m going just use simple session variables, as the facility for doing this will be encapsulated, and based on a key in the Django session - which users cannot set directly. If someone has access to this (or access to setting a Postgres session variable directly, then they have enough access to do whatever they want).

There are some caveats: specifically, the Postgres user must not be a SUPERUSER, but that’s easy to sort out. We’ll be able to continue to use PGBouncers or similar, but only if we use use session pooling (not transaction pooling).


Now, mirroring the previous post, we have a few things that need to happen:

  • We will need some middleware that sets the (postgres) session variable.
  • We may want to have some mechanism for switching tenants (unless a user is tied to a single tenant).
  • We must have a Tenant model of some sort (because we’ll be using foreign keys to this to indicate a given row belongs to a given tenant).
  • We’ll want to be able to enable/force/disable RLS for a given table.
  • We should be able to detect the USING clause (and WITH CHECK clause) for a given table.
  • We must allow the user to overwrite the USING/WITH CHECK clauses for a given table.

It turns out this is much simpler than all of the things that django-boardinghouse needs to do.

It also turns out that we can cascade the USING/WITH CHECK clauses for dependent tables, but we’ll get to that. I’m not sure how well that will perform, but it might be reasonable.


Since all good projects need a clever name, I’ve chosen django-occupation for this one (as a play on multi-tenancy). Thus, you may see the name occupation used in a few places. Also, this will be a strictly Django 2.0+ (and therefore Python3) app!

Let’s start with the easy bits:

# occupation/middleware.py
def ActivateTenant(get_response):
    def middleware(request):
        connection.cursor().execute(
            'SET occupation.active_tenant = %s',
            [request.session.get('active_tenant', '')]
        )
        return get_response(request)
    return middleware

This middleware will set a session variable. Importantly, it always sets this variable, because the rules we will be creating later rely on this being present: exceptions will result from a missing current_setting. Setting it to an empty string will mean that no rows will be returned when no tenant is selected, which is acceptable.

The code for switching tenants is a bit more complicated, and it probably needs to be. It will need some method of detecting if the given user is indeed permitted to switch to the target tenant, which could be dependent on a range of other things. For instance, in our multi-tenant application, an employee needs to be currently (or in the future) employed in order to get access, but some users may get access for other reasons (ie, a Payroll company).

We can use a view that specifically handles this, but with django-boardinghouse I also came up with a middleware that can handle this. There are, in that project, three mechanisms for switching tenants: a query parameter, an HTTP header, and a raw view. The rationalé for this was that a URL (containing a query parameter) could be used to have a permanent link to an object (which works across tenants). The drawback is that it does leak some information (about the tenant id). In practice, having this as a UUID may be nice.

Having a view that switches tenant makes doing a switch (and getting a success code if it works) easy, and having a header might make it easier for an API to switch.

Anyway, we can ignore this requirement for now.

I’ve used the same “swappable” concept in django-boardinghouse that django.contrib.auth uses for swappable user models. This has some nice side effects, but an understanding of how this works is not necessary for understanding what is about to happen next. Instead, let’s look at the definition of some models. Please keep in mind that this is a simplified example, and some parts have been omitted for clarity.

class School(models.Model):
    "This is our Tenant model."
    name = models.CharField(unique=True)

    def __str__(self):
        return self.name


class Student(models.Model):
    name = models.CharField(max_length=128)
    student_number = models.CharField(max_length=16)
    school = models.ForeignKey('School', related_name='students', on_delete=models.CASCADE)

    class Meta:
        unique_together = (
            ('school', 'student_number'),
        )

    def __str__(self):
        return self.name


class Subject(models.Model):
    name = models.CharField(unique=True, max_length=64)

    def __str__(self):
        return self.name

GRADES = [
  # ...
]


class Enrolment(models.Model):
    student = models.ForeignKey(Student, related_name='enrolments', on_delete=models.CASCADE)
    subject = models.ForeignKey(Subject, related_name='enrolments', on_delete=models.CASCADE)
    grade = models.CharField(choices=GRADES, max_length=3, null=True, blank=True)

    def __str__(self):
        if self.grade:
            return '{student} studied {subject}. Grade was {grade}.'.format(
                student=self.student.name,
                subject=self.subject.name,
                grade=self.get_grade_display(),
            )
        return '{student} is enrolled in {subject}.'.format(
            student=self.student.name,
            subject=self.subject.name,
        )

Okay, wall of code done. There are a few things to note about these models:

  • School is the tenant model.
  • Student has a direct relationship to the tenant model. This is a candidate for RLS.
  • Subject has no relationship to the tenant model. This is a non-tenant (ie, global) model. All instances will be visible to all users.
  • Enrolment has a chained relationship to the tenant model. Because of this, it’s likely that this will also be an RLS model (if the prior models in the chain have RLS restrictions).

Now a digression into some mechanics of RLS.

Enabling RLS for a given table is quite simple. We’ll do two a FORCE, because we are probably the table owner, and without FORCE, table owners may view all rows.

ALTER TABLE school_student ENABLE ROW LEVEL SECURITY;
ALTER TABLE school_student FORCE ROW LEVEL SECURITY;

In the case of a student, the user should only be able to view them if they are currently viewing the school the student belongs to:

CREATE POLICY access_tenant_data ON school_student
USING (school_id::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK (school_id::TEXT = current_setting('occupation.active_tenant'))

Notice that we used the current_setting('occupation.active_tenant') that we configured before. As I mentioned, this policy will throw an exception if the setting is not set, so our middleware sets it to an empty string - which should not match any rows.

The other thing that may look out of place is that we are coercing the school_id to a TEXT. This is because current_setting() returns a text value, even if it was set using a number.

So, what does this actually do?

It restricts the query to only rows that match the USING clause (in the case of a SELECT, UPDATE or DELETE), and then ensures that any rows that are being written (in the case of UPDATE or INSERT) meet the same restriction. This prevents a user accidentally (or on purpose) writing a row that they could not currently view.

So, that’s the SQL. Can we generate this in a nice way from our Django models?

CREATE_POLICY = '''
CREATE POLICY access_tenant_data ON {table}
USING ({fk}::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK ({fk}::TEXT = current_setting('occupation.active_tenant'))'''

def build_policy_clause(model):
    for field in model._meta.fields:
        if field.related_model is School:
            return CREATE_POLICY.format(fk=field.db_column, table=model._meta.db_table)

Again, this is simplified. It only works for a direct link, and naïvely assumes the db_column exists. In practice there’s more to it than that. But that will do for now.

So, given our knowledge of our models, we don’t need to enable RLS for our Subject model, but we want to enable it for our Enrolment model. In fact, we will need to - otherwise a user would be able to load up an Enrolment object, but not be able to see the related Student.

In fact, we use this relation (and the fact that the restriction is already applied to all queries) to make our policy for that table simpler:

CREATE POLICY access_tenant_data ON school_enrolment
USING (student_id IN (SELECT id FROM school_student))
WITH CHECK (student_id IN (SELECT id FROM school_student))

Notably, this sort of CHECK happens every time Postgres writes a FOREIGN KEY reference: we need to repeat it because FK references are not subject to RLS, but we basically want to make it so they are.

Interestingly, because of the cascading nature of this configuration, we don’t need to include the current_setting call at all, because that happens in the inner query.

However, it does concern me that this will result in more work in the database. I’ll have to run some tests on larger data sets to see how this performs.

Building up the SQL to use there is slightly more complicated: we need to look at every foreign key on the model and see which of them can trace a chain up to the tenant model. Then we’d need a clause in the USING/WITH CHECK for each of those foreign keys.

I do have some code that does this, but it’s not very pretty.

Also, I’d like to be able to come up with a way that generates this SQL using more of the ORM, but I’m not sure it’s really necessary, since the resulting code is quite simple.

As for applying these changes - the two solutions are to create a RunSQL call for each required statement, and writing this directly to the migration file, or having a migration operation that executes the SQL. I’m not sure which way I’ll drop with that just yet.


I do have a proof of concept for this up and running (code is available at django-occupation). There are still some things I want to figure out.

  • Cross-tenant queries are a thing in my domain - what is the best mechanism for doing this. Should there be a postgres session variable that ignores it, or could we enumerate tenants? That would allow restricted cross-tenant queries.
  • Just how well does this perform at scale?
  • How much of this stuff is not really related to multi-tenancy, and could be extracted out into a more generic RLS package?

Tree data as a nested list redux

Some time ago, I wrote about using python to aggregate data that is stored with a Materialized Path into a Nested List structure.

But we should be able to do that same aggregation using Postgres, and from an Adjacency List structure.

Let’s start with a table definition:

CREATE TABLE location (
  node_id SERIAL PRIMARY KEY,
  name TEXT,
  parent_id INTEGER REFERENCES location(node_id)
);

And some data:

INSERT INTO location (node_id, name, parent_id) VALUES
  (1, 'Australia', NULL),
  (2, 'South Australia', 1),
  (3, 'Victoria', 1),
  (4, 'South-East', 2),
  (5, 'Western Districts', 3),
  (6, 'New Zealand', NULL),
  (7, 'Barossa Valley', 2),
  (8, 'Riverland', 2),
  (9, 'South Island', 6),
  (10, 'North Island', 6),
  (11, 'Eastern Bay of Plenty', 10);

To begin with, we need to get all of the items, and their depth:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
)
SELECT * FROM location_with_level;
 node_id │         name          │ parent_id │ lvl
─────────┼───────────────────────┼───────────┼─────
       1 │ Australia             │    <NULL> │   0
       6 │ New Zealand           │    <NULL> │   0
       2 │ South Australia       │         1 │   1
       3 │ Victoria              │         1 │   1
       9 │ South Island          │         6 │   1
      10 │ North Island          │         6 │   1
       4 │ South-East            │         2 │   2
       5 │ Western Districts     │         3 │   2
       7 │ Barossa Valley        │         2 │   2
       8 │ Riverland             │         2 │   2
      11 │ Eastern Bay of Plenty │        10 │   2
(11 rows)

Because of the way recursive queries work, we need to find the deepest node(s), and start there:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
)

SELECT * FROM maxlvl;

We then need to build up the tree (this clause is the next one in our CTE chain, I’ve omitted the first two for clarity):

c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

The first part of this query gets all of the deepest leaf nodes.

This is then combined with another recursive subquery, that creates branches. This relies on the fact it’s possible use the “type”, and have records as columns in a query. The second part of this subquery finds all remaining leaf nodes, and combines them in. This second subquery will keep executing until it doesn’t find any new rows, which will happen when all root nodes have been processed.

We can see from the results of this last clause that we just need to limit this to root nodes:

 node_id │         name          │ parent_id │ lvl │                                                   children
─────────┼───────────────────────┼───────────┼─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────
       4 │ South-East            │         2 │   2 │ <NULL>
       5 │ Western Districts     │         3 │   2 │ <NULL>
       7 │ Barossa Valley        │         2 │   2 │ <NULL>
       8 │ Riverland             │         2 │   2 │ <NULL>
      11 │ Eastern Bay of Plenty │        10 │   2 │ <NULL>
       3 │ Victoria              │         1 │   1 │ [{"name": "Western Districts", "children": null}]
      10 │ North Island          │         6 │   1 │ [{"name": "Eastern Bay of Plenty", "children": null}]
       9 │ South Island          │         6 │   1 │ <NULL>
       2 │ South Australia       │         1 │   1 │ [{"name": "Riverland", "children": null}, {"name": "Barossa Valley", "children": null}, {"name": "South-East…
         │                       │           │     │…", "children": null}]
       6 │ New Zealand           │    <NULL> │   0 │ [{"name": "South Island", "children": null}, {"name": "North Island", "children": [{"name": "Eastern Bay of …
         │                       │           │     │…Plenty", "children": null}]}]
       1 │ Australia             │    <NULL> │   0 │ [{"name": "South Australia", "children": [{"name": "Riverland", "children": null}, {"name": "Barossa Valley"…
         │                       │           │     │…, "children": null}, {"name": "South-East", "children": null}]}, {"name": "Victoria", "children": [{"name": …
         │                       │           │     │…"Western Districts", "children": null}]}]
(11 rows)

So our final query, using the new jsonb_pretty function:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
),
c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

SELECT jsonb_pretty(
         array_to_json(
           array_agg(
             row_to_json(c_tree)::JSONB - 'lvl' - 'parent_id' - 'node_id'
           )
         )::JSONB
       ) AS tree
  FROM c_tree
  WHERE lvl=0;

And our results:

                           tree
 ──────────────────────────────────────────────────────────
  [
      {
          "name": "New Zealand",
          "children": [
              {
                  "name": "South Island",
                  "children": null
              },
              {
                  "name": "North Island",
                  "children": [
                      {
                          "name": "Eastern Bay of Plenty",
                          "children": null
                      }
                  ]
              }
          ]
      },
      {
          "name": "Australia",
          "children": [
              {
                  "name": "South Australia",
                  "children": [
                      {
                          "name": "Riverland",
                          "children": null
                      },
                      {
                          "name": "Barossa Valley",
                          "children": null
                      },
                      {
                          "name": "South-East",
                          "children": null
                      }
                  ]
              },
              {
                  "name": "Victoria",
                  "children": [
                      {
                          "name": "Western Districts",
                          "children": null
                      }
                  ]
              }
          ]
      }
  ]
 (1 row)

Oh, that is rather neat.

This query is mostly cribbed from a fantastic Stack Overflow answer by David Guillot.

Django bulk_update without upsert

Postgres 9.5 brings a fantastic feature, that I’ve really been looking forward to. However, I’m not on 9.5 in production yet, and I had a situation that would really have benefitted from being able to use it.

I have to insert lots of objects, but if there is already an object in a given “slot”, then I need to instead update that existing object.

Doing this using the Django ORM can be done one a “one by one” basis, by iterating through the objects, finding which one (if any) matches the criteria, updating that, or creating a new one if there wasn’t a match.

However, this is really slow, as it does two queries for each object.

Instead, it would be great to:

  • fetch all of the instances that could possibly overlap (keyed by the matching criteria)
  • iterate through the new data, looking for a match
    • modify the instance if an existing match is made, and stash into pile “update”
    • create a new instance if no match is found, and stash into the pile “create”
  • bulk_update all of the “update” objects
  • bulk_create all of the “create” objects

Those familiar with Django may recognise that there is only one step here that cannot be done as of “now”.

So, how can we do a bulk update?

There are two ways I can think of doing it (at least with Postgres):

  • create a temporary table (cloning the structure of the table)
  • insert all of the data into this table
  • update the rows in the original table from the temporary table, based on pk column

and:

  • come up with some mechanism of using the UPDATE the_table SET ... FROM () sq WHERE sq.pk = the_table.pk syntax

It’s possible to use some of the really nice features of Postgres to create a temporary table, that clones an existing table, and will automatically be dropped at the end of the transaction:

BEGIN;

CREATE TEMPORARY TABLE upsert_source (LIKE my_table INCLUDING ALL) ON COMMIT DROP;

-- Bulk insert into upsert_source

UPDATE my_table
   SET foo = upsert_source.foo,
       bar = upsert_source.bar
  FROM upsert_source
 WHERE my_table.id = upsert_source.id;

The drawbacks of this are that it does two extra queries, but it is possible to implement fairly simply:

from django.db import transaction, connection

@transaction.atomic
def bulk_update(model, instances, *fields):
    cursor = connection.cursor()
    db_table = model._meta.db_table

    try:
        cursor.execute(
            'CREATE TEMPORARY TABLE update_{0} (LIKE {0} INCLUDING ALL) ON COMMIT DROP'.format(db_table)
        )

        model._meta.db_table = 'update_{}'.format(db_table)
        model.objects.bulk_create(instances)

        query = ' '.join([
            'UPDATE {table} SET ',
            ', '.join(
                ('%(field)s=update_{table}.%(field)s' % {'field': field})
                for field in fields
            ),
            'FROM update_{table}',
            'WHERE {table}.{pk}=update_{table}.{pk}'
        ]).format(
            table=db_table,
            pk=model._meta.pk.get_attname_column()[1]
        )
        cursor.execute(query)
    finally:
        model._meta.db_table = db_table

The avantage of this is that it mostly just uses the ORM. There’s limited scope for SQL injection (although you’d probably want to validate the field names).

It’s also possible to do the update directly from a subquery, but without the nice column names:

UPDATE my_table
   SET foo = upsert_source.column2,
       column2 = upsert_source.column3
  FROM (
    VALUES (...), (...)
  ) AS upsert_source
 WHERE upsert_source.column1 = my_table.id;

Note that you must make sure your values are in the correct order (with the primary key first).

Attempting to prevent some likely SQL injection vectors, we want to build up the fixed parts of the query (and the parts that are controlled by the django model, like the table and field names), and then pass the values in as query parameters.

from django.db import connection

def bulk_update(model, instances, *fields):
    set_fields = ', '.join(
        ('%(field)s=update_{table}.column%(i)s' % {'field': field, 'i': i + 2})
        for i, field in enumerate(fields)
    )
    value_placeholder = '({})'.format(', '.join(['%s'] * (len(fields) + 1)))
    values = ','.join([value_placeholder] * len(instances))
    query = ' '.join([
        'UPDATE {table} SET ',
        set_fields,
        'FROM (VALUES ', values, ') update_{table}',
        'WHERE {table}.{pk} = update_{table}.column1'
    ]).format(table=model._meta.db_table, pk=model._meta.pk.get_attname_column()[1])
    params = []
    for instance in instances:
        data.append(instance.pk)
        for field in fields:
            params.append(getattr(instance, field))

    connection.cursor().execute(query, params)

This feels like a reasonable first draft, however I’d probably want to go look at how the query for bulk_create is created, and modify that. There’s a fair bit going on there that I haven’t followed as yet though. Note that this does not need the @transaction.atomic decorator, as it is only a single statement.

From here, we can build an upsert that assumes all objects with a PK need to be updated, and those without need to be inserted:

from django.utils.functional import partition
from django.db import transaction

@transaction.atomic
def bulk_upsert(model, instances, *fields):
    update, create = partition(lambda obj: obj.pk is None, instances)
    if update:
        bulk_update(model, update, *fields)
    if create:
        model.objects.bulk_create(create)

Versioning complex database migrations

Recently, I’ve been writing lots of raw SQL code that is either a complex VIEW, or a FUNCTION. Much of the time these will be used as the “source” for a Django model, but not always. Sometimes, there are complex functions that need to be run as a trigger in Postgres, or even a rule to execute when attempting a write operation on a view.

Anyway, these definitions are all code, and should be stored within the project they belong to. Using Django’s migrations you can apply them at the appropriate time, using a RunSQL statement.

Hovewer, you don’t really want to have the raw SQL in the migration file. Depending upon the text editor, it may not syntax highlight correctly, and finding the correct definition can be difficult.

Similarly, you don’t want to just have a single file, because to recreate the database migration sequence, it needs to apply the correct version at the correct time (otherwise, other migrations may fail to apply).

Some time ago, I adopted a policy of manually versioning these files. I have a pattern of naming, that seemed to be working well:

special_app/
  migrations/
    __init__.py
    0001_initial.py
    0002_update_functions.py
  sql/
    foo.function.0001.sql
    foo.function.0002.sql
    foo.trigger.0001.sql
    bar.view.0001.sql

The contents of the SQL files are irrelevant, and the migrations mostly so. There is a custom migration operation I wrote that loads the content from a file:

    LoadSQLScript('special_app', 'foo.function', 1)

The mechanics of how it works are not important.

So, this had been working well for several months, but I had a nagging feeling that the workflow was not ideal. This came to a head in my mind when I recognised that doing code review on database function/view changes was next to impossible.

See, the problem is that there is a completely new file each time you create a new version of a definition.

Instead, I’ve come up with a similar, but different solution. You still need to have versioned scripts for the purpose of historical migrations, but the key difference is that you don’t actually write these. Instead, you have something that notices that the “current” version of a definition is different to the latest version snapshot. You then also have a tool that copies the current version to a new snapshot, and creates a migration.

You can still modify a snapshot (for instance, if you’ve already created one, but it’s only in your current development tree), but mostly you don’t need to.

$ ./manage.py check
System check identified some issues:

WARNINGS:
?: (sql_helpers.W002) The versioned migration file for core: iso8601.function.sql is out of date,
and needs to be updated or a new version created.

Oh, thanks for that. Checking the file, I see that it does indeed need a new version:

$ ./manage.py make_sql_migrations core
...
Copied &lt;project&gt;/core/sql/iso8601.function.sql to version 0002

You still need to ensure that any dependencies between SQL files are dealt with appropriately (for instance, a function that relies on a newly added column to a view needs to have that view’s definition updated before the function can be updated). But this is a much smaller problem, and something that your database should complain about when you try to apply the migrations.

I haven’t packaged this up yet, it’s currently only an internal app, as I want to use it a bit for the next little while and see what else shakes out. I was pretty sure the way I was doing it before was “the best way” just after I thought that up, so we’ll see what happens.

On Fences and Functions

I grew up on a farm.

We had fences on the farm.

Whilst the jobs associated with fences and fencing are less than fun, the fences themselves are extremely important. They keep the livestock in the correct location. When you have a damaged or incomplete fence, even if it is only damaged in a small way, it can cost significant amounts of money, even human lives. This can vary between keeping Rams from a flock of Ewes that you don’t want them to mate with (because you need to know which Ram mated with which Ewes in order to track progeny), to livestock escaping onto a public road and causing accidents.

Fences are a good thing.


My first career was as a Design and Technology Teacher.

We use fences in woodwork. They are attachments to fixed power tools, such as drill presses and circular saws. They allow us to work safely and to get accurate, easily repeatable results. For instance. we can use a fence to cut sheets of MDF to exactly the same width, ensuring the bookcase we are making is square. Without a fence, it can still be done, but it will certainly be much harder.

Fences are a good thing.


I’d heard people describe Postgres’s CTEs (Common Table Expressions) as an “optimisation fence”. Given my previous uses of the word “fence”, I assumed that this was widely regarded as a good thing.

However, after spending some time writing really complex queries (that are most easily described using a CTE), I happened to read PostgreSQL’s CTEs are optimisation fences. It had (throughout my work within Postgres) become plain to me that each term in a CTE is materialised (if it is referenced at all), before any filtering that might occur later would allow it to be filtered earlier. Postgres is pretty good about pushing these changes down into a sub-query, but it can mean that a CTE performs worse, as it might have to do more work. However, this article points this out in some detail, and it occurred to me that perhaps some people see fences (in general) as an obstacle. Perhaps fencing something in has negative connotations?

I’m not sure that that’s exactly what the author meant (I wonder if it was sarcasm, perhaps), but it did get me thinking about how different backgrounds could result in opposite interpretations of the same terms.


I do want to veer back a bit into a technical manner, and discuss how I have been overcoming the fact that it’s not possible to push the filtering back up the stack in a CTE.

Largely, the issue exists in my code because I have lots of complex queries (as I just mentioned) that are far easier to write, reason about and debug when written using a CTE. I would like to write them as a VIEW, and then stick Django models in front of them, and I’d be able to query them using the ORM, and just have the view as the db_table of the model. It would be really nice.

But this doesn’t work, because some of the queries require aggregating data across models of which there are millions of rows, and some of the database tables are less than optimal. For instance, I have several tables that store an effective_from field, and in the case of superseding, the same set of other fields (person, for instance) means we can know which one applies on a given date. However, to query this, we end up writing a more complex query (instead of being able to do a date <@ daterange query, if the valid period was stored in the table). I’ve learned from this in newer models, but some stuff is too deeply ingrained to be able to be changed just yet.

So, I have a VIEW that turns this into a data that actually contains dateranges, and I can query against that. But, if I use this in a CTE, then it can materialise the whole lot, which can be slow. So, I needed to come up with a way to filter the data earlier.

Functions.

I’ve been writing SQL functions that take parameters, and then filter as early as possible. This then means that it’s a real possibility that we can get <100ms queries for stuff that is really, really complicated (and joins a couple of dozen or more tables in really funky ways). It does mean I can’t query using the Django ORM, but that’s okay: the data I’m getting back doesn’t necessarily map onto a model anyway, and we need to use it as a dict.

More recently, I’ve extended this so that the function (with the relevant parameters, extracted out of the queryset WHERE clauses) can be used as the db_table for a Model. It’s still somewhat hacky, but is very interesting, nonetheless.

Django Trees via Closure View

After writing up a method of using a Postgres View that generates a materialised path within the context of a Django model, I came across some queries of my data that were getting rather troublesome to write. It occurred to me that having a closure table would be useful. Specifically, I needed all of the descendants of a given set of nodes.

I couldn’t find an existing Postgres extension that will manage the closure table, and didn’t feel like writing my own implemention using triggers just yet. However, it occurred to me that I could use a similar trick to the recursive materialised path view. Thus, we have a Closure View.

We will start with the Django models:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.Node', related_name='children', null=True, blank=True)

    descendants = models.ManyToManyField('tree.Node', related_name='ancestors', through='tree.Closure')

    class Meta:
        app_label = 'tree'


class Closure(models.Model):
    path = ArrayField(base_field=models.IntegerField(), primary_key=True)
    ancestor = models.ForeignKey('tree.Node', related_name='+')
    descendant = models.ForeignKey('tree.Node', related_name='+')
    depth = models.IntegerField()

    class Meta:
        app_label = 'tree'
        managed = False

You may notice I have a path column. I’m using this for the primary key, and it may turn out to be useful later.

Let’s have a look at the View:

CREATE RECURSIVE VIEW tree_closure(path, ancestor_id, descendant_id, depth) AS

SELECT ARRAY[node_id], node_id, node_id, 0 FROM tree_node

UNION ALL

SELECT parent_id || path, parent_id, descendant_id, depth + 1
FROM tree_node INNER JOIN tree_closure ON (ancestor_id = node_id)
WHERE parent_id IS NOT NULL;

This uses a recursive query. The first part builds the self-reference relations, and the second part uses the RECURSIVE function to collect child nodes for each node already in the table (or added in previous iterations of this part of the view).

Now, because we are using the in-built Django Many to Many features, we have some nice queries ready to go:

  • node.ancestors.all() : All ancestors of a given Node instance.
  • node.descendants.all() : All descendants of a given Node instance.
  • Node.objects.filter(ancestors=queryset) : All descendants of all nodes in a queryset.
  • Node.objects.filter(descendants=queryset) : All ancestors of all nodes in a queryset.

Of particular note are the bottom two: these are rather cumbersome to write in older versions of Django.

Adjacency Lists in Django with Postgres

Today, I’m going to walk through modelling a tree in Django, using an Adjacency List, and a Postgres View that dynamically creates the materialised path of ancestors for each node.

With this, we will be able to query the tree for a range of operations using the Django ORM.

We will start with our model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('tree.node', related_name='children', null=True, blank=True)

    class Meta:
        app_label = 'tree'

We will also build an unmanaged model that will be backed by our view.

from django.contrib.postgres.fields import ArrayField

class Tree(models.Model):
    root = models.ForeignKey(Node, related_name='+')
    node = models.OneToOneField(Node, related_name='tree_node', primary_key=True)
    ancestors = ArrayField(base_field=models.IntegerField())

    class Meta:
        app_label = 'tree'
        managed = False

You’ll notice I’ve included a root relation. This could be obtained by using ancestors[0] if ancestors else node_id, but that’s a bit cumbersome.

So, on to the View:

CREATE RECURSIVE VIEW tree_tree(root_id, node_id, ancestors) AS

SELECT node_id, node_id, ARRAY[]::INTEGER[]
FROM tree_node WHERE parent_id IS NULL

UNION ALL

SELECT tree.root_id, node.node_id, tree.ancestors || node.parent_id
FROM tree_node node INNER JOIN tree_tree tree ON (node.parent_id = tree.node_id)

I’ve written this view before, so I won’t go into any detail.

We can create a tree. Normally I wouldn’t specify the primary key, but since we want to talk about those values shortly, I will. It also means you can delete them, and recreate with this code, and not worry about the sequence values.

from tree.models import Node

Node.objects.bulk_create([
  Node(pk=1),
  Node(pk=2, parent_id=1),
  Node(pk=3, parent_id=1),
  Node(pk=4, parent_id=2),
  Node(pk=5, parent_id=2),
  Node(pk=6, parent_id=3),
  Node(pk=7, parent_id=3),
  Node(pk=8, parent_id=4),
  Node(pk=9, parent_id=8),
  Node(pk=10),
  Node(pk=11, parent_id=10),
  Node(pk=12, parent_id=11),
  Node(pk=13, parent_id=11),
  Node(pk=14, parent_id=12),
  Node(pk=15, parent_id=12),
  Node(pk=16, parent_id=12),
])

Okay, let’s start looking at how we might perform some operations on it.

We’ve already seen how to create a node, either root or leaf nodes. No worries there.

What about inserting an intermediate node, say between 11 and 12?

node = Node.objects.create(parent_id=11)
node.parent.children.exclude(pk=node.pk).update(parent=node)

I’m not sure if it is possible to do it in a single statement.

Okay, let’s jump to some tree-based statements. We’ll start by finding a sub-tree.

Node.objects.filter(tree_node__ancestors__contains=[2])

Oh, that’s pretty nice. It’s not necessarily sorted, but it will do for now.

We can also query directly for a root:

Node.objects.filter(tree_node__root=10)

We could spell that one as tree_node__ancestors__0=10, but I think this is more explicit. Also, that one will not include the root node itself.

Deletions are also simple: if we can build a queryset, we can delete it. Thus, deleting a full tree could be done by following any queryset by a .delete()

Fetching a node’s ancestors is a little trickier: because we only have an array of node ids; thus it does two queries.

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors)

The count of ancestors doesn’t require the second query:

len(Node.objects.get(pk=15).tree_node.ancestors)

Getting ancestors to a given depth is also simple, although it still requires two queries:

Node.objects.filter(pk__in=Node.objects.get(pk=15).tree_node.ancestors[-2:])

This is a fairly simple way to enable relatively performance-aware queries of tree data. There are still places where it’s not perfect, and in reality, you’d probably look at building up queryset or model methods for wrapping common operations.

Postgres Tree Shootout part 3: Adjacency List using Views

It’s been a while, but I’ve finally gotten off my arsefound some time to revisit this series. As promised last time, I’m going to rewrite the queries from the Adjacency List “solutions” using a View. Indeed, there will be two versions of the view - one which is a MATERIALIZED VIEW. There will also be discussion of when the two different types of view might be best to use.

One of the reasons this post took so long to write was that I was sidetracked by writing an SVG generator that would allow for graphically seeing what the different operations discussed in this series look like in terms of an actual tree. That didn’t eventuate.

We will start by defining what our tree view will actually look like. You’ll notice is it rather like the CTE that we saw in the previous post.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
);

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

All of the insertions do not require access to the tree view, since the beauty of an Adjacency List model is that you only ever need to operate on the immediate parent-child.

Removals

Similarly, we will skip over the simple operations: those don’t require access to any more of the tree than just the parent-child relationship. It’s not until we need to remove a subtree that it becomes interesting.

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
) OR node_id = 2;

If you are paying attention, you will notice that this is virtually identical to the CTE version, except that we no longer need to redeclare the CTE each time. The full tree deletion is the same, as is removing all decscendants:

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
);

Moves

Again, the operations that don’t require the actual tree are unchanged: this is where the Adjacency List really shines.

Fetches

Since we are starting with the “full” tree, we should be able to use it for all of the queries. It is possible that these queries (unlike those we have seen before) may be slightly slower than the CTE version (specifically, those where the CTE is customised for that operation).

Descendants

Let’s get all of node 10’s descendants:

SELECT node_id FROM tree WHERE 10 = ANY(ancestors);

This query is far less complicated than the CTE version, as expected. However, when dealing with very large datasets, it performs far worse. I have a data set with 221000 nodes, in 1001 different trees. Performing this query takes around 5 seconds, but the customised CTE version takes about 750ms.

Turning this view into a materialised view:

CREATE MATERIALIZED VIEW tree_mat AS
SELECT node_id, ancestors FROM tree;

and then querying that turns this into around 75ms.

To limit the query to nodes to a given depth requires slightly more work.

SELECT node_id, ancestors FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

Ancestors

Fetching ancestors of a node is again trivial:

SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

And the count of ancestors:

SELECT ARRAY_LENGTH(ancestors, 1) FROM tree WHERE node_id=15;

Getting a set of ancestors to a given depth is actually a little tricky: because we can’t just reverse the end that we add the parent node to the ancestors array, we can’t use that trick. We’ll have to enumerate the rows, and then extract those we care about. You can’t use OFFSET with a variable, otherwise that would be a nice trick.

WITH ancestors AS (
  SELECT unnest(ancestors) AS node_id
  FROM tree
  WHERE node_id = 15
), enumerated AS (
  SELECT
    row_number() OVER () AS row,
    count(*) OVER () AS ancestor_count,
    node_id
  FROM ancestors
)
SELECT node_id
FROM enumerated
WHERE "row" > ancestor_count - 2;

Ugh. That’s way worse than the CTE version.

Special queries

None of the special queries access the tree in any way, so can be omitted for now.

Discussion

So how does using a view stack up to the ad-hoc CTE queries?

Mostly pretty well. In the case where you have only small data sets, then the view that builds up the complete tree each time is not that much of a problem (I ran some tests with tens of thousands of items, and it still performed relatively well). When it moves up to hundreds of thousands, then the ad-hoc CTE versions can greatly outperform the full tree view.

However, using a materialised view changes everything. It now becomes just as fast as querying a table: indeed, that’s just what it is. You could have triggers based on changes to the nodes table causing a REFRESH MATERIALIZED VIEW, but it is worth keeping in mind that this will take some time: in my case, a full refresh of 221000 rows took upwards of 4.5 seconds.

Using a materialised view gets us most of the way to (and leads nicely into the next method), storing a materialised path. The similarity of the names here should be a trigger, but now I’m just making foreshadowing jokes.

slugify() for postgres (almost)

A recent discussion in #django suggested “what we need is a PG slugify function”.

The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.

Function slugify(value, allow_unicode=False).

  • Convert to ASCII if allow_unicode is false
  • Remove characters that aren’t alphanum, underscores, hyphens
  • Strip leading/trailing whitespace
  • Convert to lowercase
  • Convert spaces to hyphens
  • Remove repeated hyphens

(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).

We can write an SQL function that uses the Postgres unaccent extension to get pretty close:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\w\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";

$$ LANGUAGE SQL STRICT IMMUTABLE;

I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\w\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\s]+', '-', 'gi'
  ) FROM "normalized";

$$ LANGUAGE SQL STRICT IMMUTABLE;

To get the default value for the second argument, we can have an overloaded version with only a single argument:

CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;

Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:

BEGIN;

SELECT plan(7);

SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');

SELECT * FROM finish();

ROLLBACK;

And we get one failing test:

=# SELECT is(slugify('你好', true), '你好');

          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)

Time: 2.004 ms

It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.