Horizontal Partitioning in Postgres

It never surprises me when I find another neat feature of Postgres that makes doing a potentially difficult task simple. Today, I discovered that since 9.0, Postgres has supported a really powerful way to horizontally partition data into separate tables.

For those who haven’t heard of the concept before, horizontal partitioning is where different rows are stored in different tables, depending upon something about the data within the row.

For instance, we could partition audit data into tables based upon the timestamp of the action. Thus, all entries created during October, 2014 would be stored in a table called audit_2014_10, and entries created during March 2011 would be stored in a table called audit_2011_03. And so on. Alternatively, we could have a single table per year, or however we want to partition.

This is called “Range” partitioning.

There are a couple of ways you could think about doing this in a DBMS. You could have a writable view that redirects the writes to the correct table. The problem then is that when you add a new child table, you need to rewrite your view.

Instead, we can use Postgres’ neat table inheritance to handle all of this for you. Indeed, it is discussed in the Postgres documentation.

If we inherit one table from another, and do a query on the parent table, we will also get the rows that match the query from all child tables. That obviates the need for a view that uses UNION ALL or similar to fetch the data.

I’m going to use a toy example here, that just contains a single column.

CREATE TABLE "data" ("value" TIMESTAMPTZ);

CREATE TABLE "data_2014" (
  CHECK (
    "value" >= '2014-01-01' AND "value" < '2015-01-01')
) INHERITS ("data");

CREATE TABLE "data_2015" (
  CHECK (
    "value" >= '2015-01-01' AND "value" < '2016-01-01')
) INHERITS ("data");

This, however, is only part of the picture. Any data that is added to either of the child tables (or indeed the parent table, but we won’t be doing that), will be returned when we query the parent table.

But we want to ensure that data is partitioned out nicely. For that, we can use a trigger.

A naïve trigger may look something like:

CREATE OR REPLACE FUNCTION data_insert_trigger()
RETURNS TRIGGER AS $$

BEGIN

  IF (NEW.value >= '2014-01-01' AND NEW.value < '2015-01-01') THEN
    INSERT INTO data_2014 VALUES (NEW.*);
  ELSIF (NEW.value >= '2015-01-01' AND NEW.value < '2016-01-01') THEN
    INSERT INTO data_2015 VALUES (NEW.*);
  ELSE
    RAISE EXCEPTION 'Date out of range. Please fix the data_insert_trigger() function.';
  END IF;

  RETURN NULL;
END;

$$ LANGUAGE plpgsql;

As you can see by the ELSE clause, we will actually need to do maintainence on this function as we start to get data that falls outside of our existing ranges. We will also need to create a new table for those rows.

It would be nice if we could automatically create tables that are missing, and handle any arbitrary values.

CREATE OR REPLACE FUNCTION data_insert_trigger()
RETURNS TRIGGER AS $$

DECLARE
  table_name text;
  year integer;
  start text;
  finish text;

BEGIN
  year := date_part('year', NEW.value);
  table_name := 'data_' || year;
  start := year || '-01-01';
  finish := (year + 1) || '-01-01';

  PERFORM 1 FROM pg_tables WHERE tablename = table_name;

  IF NOT FOUND THEN
    EXECUTE
      'CREATE TABLE '
      || quote_ident(table_name)
      || ' (CHECK ("value" >= '
      || quote_literal(start)
      || ' AND "value" < '
      || quote_literal(finish)
      || ')) INHERITS (data)';
  END IF;

  EXECUTE
    'INSERT INTO '
    || quote_ident(table_name)
    || ' VALUES ($1.*)'
  USING NEW;

  RETURN NULL;
END;

$$ LANGUAGE plpgsql;

CREATE TRIGGER data_insert_trigger
BEFORE INSERT ON data
FOR EACH ROW EXECUTE PROCEDURE data_insert_trigger();

You would also want to create any indexes on the child tables, as this is where they need to be, rather than the parent table.

This function is pretty neat: it first stores what the table name should be in a variable, as well as the two bounds for this table (start and finish). Then, we see if that table exists, and if not, create it. Finally, we then insert the values into the correct child table. I’m not sure I’d recommend using it as-is: it’s quite possibly subject to a race condition if two new records came in at the same time.

The one thing that was concerning me was that DDL changes to the parent table would not propagate to the child tables: however this turned out to not be an issue at all. Since I mostly use Django, I want as little hard stuff that would require custom migration operations.

ALTER TABLE data ADD COLUMN confirmed BOOLEAN DEFAULT false;

The other thing worth noting is that Postgres will do a really good job of limiting the tables that are accessed to those that contain the relevant data:

INSERT INTO data VALUES
  ('2014-01-06 09:00:00'),
  ('2015-01-09 12:00:00'),
  ('2016-02-22 15:39:00');

EXPLAIN
SELECT * FROM data
WHERE value > '2014-01-01'
AND value < '2014-07-01';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Append  (cost=0.00..42.10 rows=12 width=8)
   ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
   ->  Seq Scan on data_2014  (cost=0.00..42.10 rows=11 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
 Planning time: 0.292 ms
(6 rows)

We see that only the data_2014 table is hit by the query. If your constraints do something like cast to DATE, then this may not happen. This was causing me some concern earlier, but letting Postgres coerce the data types fixed it.

However, you can’t use a tstzrange to query if you want these constraints to help the query planner:

-- Actually hits every table.
SELECT * FROM data WHERE value <@ '[2014-01-01, 2014-07-01)'::tstzrange;

It’s worth noting that if you change a value that would cause that row to belong in a different partition, this will fail.

There are moves afoot to have this feature more tightly integrated into Postgres, perhaps using a syntax like:

CREATE TABLE data (value TIMESTAMPTZ)
PARTITION BY RANGE (value)
(PARTITION data_2014 VALUES LESS THAN '2015-01-01');

CREATE PARTITION data_2015 ON data VALUES LESS THAN '2016-01-01';

It’s not clear to me how you actually define the range. It also seems counter-productive to have to manually create the partition tables.

There are also tools that might be useful to handle the heavy lifting, like pg_partman. I haven’t used this, but it looks interesting.

Boiled Chocolate Cake

Boil together:

  • 125g butter
  • 1 cup water
  • 1 1/2 cups sugar
  • 2 tablespoons cocoa
  • 1/2 teaspon bicarb soda

Simmer 5 mins.

Add:

  • 2 eggs
  • 1 1/2 cups self raising flour

Bake about 1/2 an hour at 180°C in 2 log tins or a round tin.

Don’t eat it all at once.

iOS 8 knows what you want to say

Had a mind-blowing moment when messaging using iOS 8 tonight.

Me: Can you get some more coffee beans please?

Her: Ok, caf or decaf?

This is where it gets awesome. The three choices I had in my autocomplete bar were:

  1. Ok
  2. Caf
  3. Decaf

So, it apparently parses messages, and sets the choices accordingly.

I only wish I had a screenshot. That would have been less explaining.

Read-only data from your database in Django

I had the need to create a column in some database tables that are completely controlled by the database, but the value of which is sometimes needed by the Django object.

It should never be presented in a Form, and never, ever be written to by the Django infrastructure.

So, we need a way to fetch the data from the database, but, even if the value is changed, and the object saved, is not written back.

The detail of how this data is set in the database is irrelevant: it’s a column that gets it’s value from a sequence (and, incidentally, this sequence is shared across multiple tables).

But, we need a way to get this data.

A nice technique is to leverage two parts of Django: the QuerySet.extra(select={}) method to actually add this field to the query, and Manager.get_query_set() (get_queryset() in older versions of Django) to make this apply to every fetch of the objects.

Our extra column will be called sequence_number

class SequenceNumberManager(models.managers.Manager):
    def get_query_set(self):
      return super(SequenceNumberManager, self).get_query_set().extra(select={
        'sequence_number': '"%s"."sequence_number"' % self.model._meta.db_table
      })

class Thing(models.Model):
    # Column definitions. Do not define sequence_number!

    objects = SequenceNumberManager()

That’s it.

Now, Thing.objects.all()[0].sequence_number will give you the value from the database. Because it’s not a Django field, it will never be written back to the database.

The SequenceNumberManager, as it stands, could be applied to multiple models, as it dynamically looks for the database table that should be used for the field lookup. You need to define the table name, as otherwise if you join to another table with the same column name, your database will (correctly) reject the query, as it is ambiguous to which table it refers.

I’ve done a similar thing, but using django-model-utilsPassThroughManager in conjunction with a QuerySet subclass, since I’m a big fan of the approach described at Building a higher-level query API: the right way to use Django’s ORM, which I link to at least once a week in IRC.

I’d be interested to hear of alternative methods of achieving this, or if there are any possible drawbacks. I think you could do it with a custom field (that never wrote to the database), but I think that would actually be a whole lot more code. There is also the benefit in this case that you could have the .extra() call as a queryset method, so you only add it when you need it, although the performance gains there would be insignificant, I suspect.

Adding JSON(B) operators to PostgreSQL

This is a followup/replacement post for this older post. In it, I discussed a way to use PL/python to perform function (and in turn operations) on JSON data. Don’t do that, it’s way too slow.

It’s actually possible, using raw SQL (and some handy functions) to perform some of the operations that are “missing” from the JSON(B) datatypes in Postgres.

In all cases, I’ll work with just JSONB as the input/output formats. In practice, when I first wrote these, I wrote up to four versions of each (JSON+JSON, JSON+JSONB, JSONB+JSON, JSONB+JSONB). I believe it’s possible to write polymorphic versions of these functions, but I’m not that familiar with them just yet.

We’ll start with concatenation: basically joining two JSONB objects into one. This is a good one to start with, as operands must be either JSON or JSONB, so all forms of this function are the same, just with different functions or casting of operands.

Before I begin, I’ll mention a post I saw on Michael Paquier’s excellent blog: Manipulating jsonb data by abusing of key uniqueness. In it, Michael uses the json_object_agg function to build up a JSON object from a query:

CREATE FUNCTION "json_append" (jsonb, jsonb) RETURNS jsonb AS $$

WITH json_union AS
(
  SELECT * FROM jsonb_each($1)
  UNION ALL
  SELECT * FROM jsonb_each($2)
)
SELECT json_object_agg(key, value) FROM json_union;

$$ LANGUAGE SQL;

This seems like a good idea, however it actually performs around two orders of magnitude slower than just iterating over the objects and building them up using string_agg and ||:

CREATE FUNCTION "json_concatenate" (jsonb, jsonb) RETURNS jsonb AS $$

SELECT ('{' || string_add(to_json("key")::text || ':' ||"value", ',') || '}')::jsonb
FROM (
  SELECT * FROM jsonb_each($1) UNION ALL SELECT * jsonb_each($2)
);

$$ LANGUAGE SQL;

CREATE OPERATOR || (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_concatenate
);

It seems that this is not just because of the Common Table Expression (although, using a CTE does make the second function perform just as poorly).

=# SELECT * FROM BENCHMARK(10000,
  'jsonb_concatenate(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)',
  'jsonb_append(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)'
  );
           code              |  runtime   | corrected
-----------------------------+------------+------------
 [Control]                   | 0.00550699 |          0
 jsonb_concatenate(...)      | 0.00652981 | 0.00102282
 jsonb_append(...)           |   0.484099 |   0.478592

My attempt at reimplementing the json_object_agg aggregate function in SQL proved even slower. Not at all surprising.

The next one we will tackle is the - operator from Hstore.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

We can reimplement these for JSON: first as functions, and then create operators using those if we want. What is interesting is that we use the same construction pattern for our output JSONB object, however, I don’t seem to be able to figure out how to extract this out into another function.

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" TEXT
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ? "remove" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json") -- Until this function is added!
    WHERE "key" <> "remove"),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text,
  PROCEDURE = jsonb_subtract
);

You’ll notice that there’s a test for if the key to remove is in the object first: this should be much faster in the situation where it doesn’t appear, as we then don’t need to recreate the object.

The other forms are quite similar, but the WHERE clause varies, and the initial test varies or is removed:

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "keys" TEXT[]
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ?| "keys" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "key" <> ALL ("keys")),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = json,
  PROCEDURE = jsonb_subtract
);

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "json"->>"key" <> ("remove"->>"key")),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_subtract
);

There is also the #= operator for hstore: but this seems to be record #= hstore, rather than hstore #= hstore. I’m not sure how to implement this, but I can implement a jsonb #= jsonb function:

CREATE OR REPLACE FUNCTION "jsonb_update_only_if_present"(
  "json" jsonb,
  "other" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$

SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM (SELECT * FROM jsonb_each("json") UNION ALL SELECT * FROM jsonb_each("other")) AS a
     WHERE "json" ? "key"::text
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR #= (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_update_only_if_present
);

Finally, there are operators %% (convert to array of alternating key/value pairs), and %# (convert to 2-dimensional array of keys, values).

I haven’t figured out a way to create these yet either.


So, that’s what I’ve got so far. Obviously, these will be slower than pure C implementations however, we can run some benchmarks against the hstore operators for comprisons.

=# SELECT * FROM BENCHMARK(100000,
   ' ''{"a": 1, "b":2}''::jsonb || ''{"a":2}''::jsonb',
   ' ''a=>1, b=>2''::hstore || ''a=>2''::hstore'
   );
 
                     code                      |  runtime  |  corrected
-----------------------------------------------+-----------+-------------
 [Control]                                     | 0.0837061 |           0
  '{"a": 1, "b":2}'::jsonb || '{"a":2}'::jsonb | 0.0842681 | 0.000561953
  'a=>1, b=>2'::hstore || 'a=>2'::hstore       | 0.0843148 | 0.000608683
(3 rows)

Whoa. That’s actually pretty good!

And similar for subtract:

=# SELECT * FROM BENCHMARK(100000,
  ' ''{"a": 1, "b":2}''::jsonb - ''a''::text ',
  ' ''a=>1, b=>2''::hstore - ''a''::text '
);
                  code                  |  runtime  | corrected
----------------------------------------+-----------+------------
 [Control]                              | 0.0818689 |          0
  '{"a": 1, "b":2}'::jsonb - 'a'::text  |  0.083431 | 0.00156212
  'a=>1, b=>2'::hstore - 'a'::text      |  0.083159 | 0.00129008
(3 rows)

I might have to run these with some more complicated queries, and compare the results.

Using Postgres Composite Types in Django

Note: this post turned out to be far more complicated than I had hoped. I may write another one that deals with a less complicated type!

Postgres comes with a pretty large range of column types, and the ability to use these types in an ARRAY. There’s also JSON(B) and Hstore, which are useful for storing structured (but possibly varying) data. Additionally, there are also a range of, well, range types.

However, sometimes you actually want to store data in a strict column, but that isn’t a simple scalar type, or one of the standard range types. Postgres allows you to define your own composite types.

There is a command CREATE TYPE that can be used to create an arbitrary type. There are four forms: for now we will just look at Composite Types.

We will create a Composite type that represents the opening hours for a store, or more specifically, the default opening hours. For instance, a store may have the following default opening hours:

+------------+--------+---------+
|    Day     |  Open  |  Close  |
+------------+--------+---------+
|  Monday    |  9 am  |  5 pm   |
|  Tuesday   |  9 am  |  5 pm   |
|  Wednesday |  9 am  |  5 pm   |
|  Thursday  |  9 am  |  9 pm   |
|  Friday    |  9 am  |  5 pm   |
|  Saturday  | 10 am  |  5 pm   |
|  Sunday    | 11 am  |  5 pm   |
+------------+--------+---------+

During the Christmas season this store may be open longer (perhaps even 24 hours). There may also be differences at Easter time, or other public holidays, where the store is closed, or closes early.

It would be nice to be able to store the default opening hours for a store, and then, when creating a week, use these to create concrete (TIMESTAMP) values for each day, which could be overridden on any given day.

There are a few ways we could model this. Postgres has no timerange type, so that’s out. We could create a RANGE type, or we could use (start-time, finish-time). But what about when a store is open after midnight, or for 24 hours? Storing this data implicitly is a real pain, because you need to always check to see if the finish time is less than (or equal to) the start time whenever doing anything. Trust me, this is not the best approach.

An alternative I’ve been toying with is (start-time, interval). You could limit it so that the interval’s maximum is '1 day', but not (from what I can tell) when you define the type. Anyway, the syntax for creating this type is:

CREATE TYPE opening_hours AS (
  start time,
  length interval
);

As an aside, every table in the database also has an associated type (of the same name as the table).

Now, we have our type: we can use it in a table:

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT
);

CREATE TABLE default_opening_hours (
  store_id INTEGER REFERENCES store (store_id),
  monday opening_hours,
  tuesday opening_hours,
  wednesday opening_hours,
  thursday opening_hours,
  friday opening_hours,
  saturday opening_hours,
  sunday opening_hours
);

An alternative way of storing this information might be to use an array of opening_hours, directly on the store model. We’ll use this one instead, as it’s a little neater (and means we will look at how to use opening_hours[] later too).

CREATE TABLE store (
  store_id SERIAL PRIMARY KEY,
  name TEXT,
  default_opening_hours opening_hours[7]
);

Now, we can put data in there:

INSERT INTO store (name, default_opening_hours) VALUES
(
  'John Martins',
  ARRAY[
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '12:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('10:00', '07:00')::opening_hours,
    ('11:00', '06:00')::opening_hours
  ]
);

Note how we need to cast all of the values from record to opening_hours.


In practice, we would probably also want to have some type of restriction where the opening time from one day, plus the default open hours is less than or equal to the starting time on the next day. I’m still not sure of the best way to do this in Postgres, but it is possible to do it in Django.


Speaking of Django, we want to be able to access this data type there. We can leverage a really nice feature of Psycopg2 to have these values automatically turned into a Python namedtuple. We do this by registering the type within Psycopg2, using the Django cursor.

from django.db import connection
from psycopg2.extras import register_composite

register_composite('opening_hours', connection.cursor().cursor)

But, this is only half of the pattern. We also need to register an adapter so that values going back the other way are also automatically cast into opening_hours.

from django.db import connection
from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs

# Get a reference to the namedtuple class
OpeningHours = register_composite(
  'opening_hours',
  connection.cursor().cursor,
  globally=True
).type

def adapt_opening_hours(value):
  return AsIs("(%s, %s)::opening_hours" % (
    adapt(value.start).getquoted(),
    adapt(value.length).getquoted()
  ))

register_adapter(OpeningHours, adapt_opening_hours)

Now, we can fetch data from the database, and know that we will get OpeningHours instances, and, when passing an OpeningHours instance back to the database, know it will be converted into the correct type.

Obviously, in order to do this, the type must exist in the database. We did that manually in this case. In a real situation you would want to do that as a database migration. And that is where things get tricky. You can’t run the register_adapter function until the type exists in the database. I did come up with a relatively neat workaround for this when writing a framework for generic Composite fields, where the registration of the composite type attempts to execute, and if it fails, it stores the data for later registration, and then the actual migration operation fires off a signal, which is handled by a listener that actually performs the registration.

The final piece of the puzzle is the Django Field subclass, which is actually not that complicated. In essence, we are relying on Psycopg to handle the adaptation in both directions, so it can be a bare field (perhaps with a formfield method to get a custom form field). In practice, I wrote the generic CompositeField subclass, which uses some metaclass magic to handle the late registration:

from django.db.models import fields
from django.db import connection
from django.dispatch import receiver, Signal

from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs
from psycopg2 import ProgrammingError


_missing_types = {}

class CompositeMeta(type):
    def __init__(cls, name, bases, clsdict):
        super(CompositeMeta, cls).__init__(name, bases, clsdict)
        cls.register_composite()

    def register_composite(cls):
        db_type = cls().db_type(connection)
        if db_type:
            try:
                cls.python_type = register_composite(
                    db_type,
                    connection.cursor().cursor,
                    globally=True
                ).type
            except ProgrammingError:
                _missing_types[db_type] = cls
            else:
                def adapt_composite(composite):
                    return AsIs("(%s)::%s" % (
                        ", ".join([
                            adapt(getattr(composite, field)).getquoted() for field in composite._fields
                        ]), db_type
                    ))

                register_adapter(cls.python_type, adapt_composite)


class CompositeField(fields.Field):
    __metaclass__ = CompositeMeta
    """
    A handy base class for defining your own composite fields.

    It registers the composite type.
    """


composite_type_created = Signal(providing_args=['name'])

@receiver(composite_type_created)
def register_composite_late(sender, db_type, **kwargs):
    _missing_types.pop(db_type).register_composite()

We also want to have a custom migration operation:

from django.db.migrations.operations.base import Operation

# Or wherever the code above is located.
from .fields.composite import composite_type_created


class CreateCompositeType(Operation):
    def __init__(self, name=None, fields=None):
        self.name = name
        self.fields = fields

    @property
    def reversible(self):
        return True

    def state_forwards(self, app_label, state):
        pass

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('CREATE TYPE %s AS (%s)' % (
            self.name, ", ".join(["%s %s" % field for field in self.fields])
        ))
        composite_type_created.send(sender=self.__class__, db_type=self.name)

    def state_backwards(self, app_label, state):
        pass

    def database_backwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('DROP TYPE %s' % self.name)

This is a bit manual, however. You need to create your own migration that creates the composite type, and then begin to use the field.

# migrations/XXXX_create_opening_hours.py

class Migration(migrations.Migration):
    dependencies = []

    operations = [
        CreateCompositeType(
            name='opening_hours',
            fields=[
                ('start', 'time'),
                ('length', 'interval')
            ],
        ),
    ]

The place this pattern falls down is that this migration must be manually created: we don’t have any way to automatically create the migration from the Field subclass, which just looks like:

class OpeningHoursField(CompositeField):

    def db_type(self, connection):
        return 'opening_hours'

    def formfield(self, **kwargs):
        defaults = {
            'form_class': OpeningHoursFormField
        }
        defaults.update(**kwargs)
        return super(OpeningHoursField, self).formfield(**defaults)

I think in the future I’ll attempt to use further metaclass magic to allow defining the fields of the Composite type. This could then be used to automatically create a form field (which is a subclass of forms.MultiValueField).

class OpeningHoursField(CompositeField):
    start = models.DateField()
    length = IntervalField()

    def db_type(self, connection):
        return 'opening_hours'

However, in the meantime, we can still get by. I’m not sure it’s going to be possible to inject extra operations into the migration based upon the field types anyway.

Finally, we can use this in a model:

class Store(models.Model):
    store_id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=128)
    default_opening_hours = ArrayField(
        base_field=OpeningHoursField(null=True, blank=True),
        size=7
    )

I’ve used the ArrayField from django.contrib.postgres, purely for illustration purposes.

The CompositeField and associated operation are part of my django-postgres project: once I have worked out some more kinks, I may submit a pull request to django.contrib.postgres, unless someone else beats me to it.

Oh, and a juicy little extra. Above I mentioned something about preventing overlaps. The logic I use in my form is:

from django import forms
from django.utils.translation import string_concat, ugettext_lazy as _

import postgres.forms

from .fields import OpeningHoursFormField
from .models import Store


def finish(obj):
    "Given an OpeningHours value, get the finish time"
    date = datetime.date(1, 1, 1)
    return (datetime.datetime.combine(date, obj.start) + obj.duration).time()


class StoreForm(forms.ModelForm):
    OVERLAPS_PREVIOUS = _('Open hours overlap previous day.')

    default_opening_hours = postgres.forms.SplitArrayField(
        base_field=OpeningHoursFormField(required=False),
        size=7,
    )

    class Meta:
        model = Store

    def clean_default_opening_hours(self):
        opening_hours = self.cleaned_data['default_opening_hours']
        field = self.fields['default_opening_hours']

        # Ensure consecutive days do not overlap.
        errors = []

        for i in range(7):
            today = opening_hours[i]
            if today.start is None or today.duration is None:
                continue

            yesterday = opening_hours[(i + 6) % 7]

            if yesterday.start is None or yesterday.duration is None:
                continue

            if finish(yesterday) <= yesterday.start:
                if today.start < finish(yesterday):
                    errors.append(forms.ValidationError(
                        string_concat(
                          field.error_messages['item_invalid'],
                          self.OVERLAPS_PREVIOUS
                        ),
                        code='item_invalid',
                        params={'nth': i}
                    ))

        if errors:
            raise forms.ValidationError(errors)

        return opening_hours

I’m currently not displaying the duration/length: I dynamically calculate it based on the entered start/finish pair, but that’s getting quite complicated.

Performance testing Adjancency List recursive queries

Yesterday, I wrote up some ideas about doing recursive queries on Adjacency Lists using Postgres. Today, I wrote up some code that allows me to run some tests on larger data sets. It’s worth noting that this is still somewhat “toy” data, but I did see comparable results with a real query.

Firstly, our data structure:

CREATE TABLE node (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES node(node_id)
);

Now, we want to be able to populate it with test data. This function will allow you to populate any number of records, with a 10% chance that any given record will be a root (have no parent). If it has a parent, it will be randomly selected from all existing rows. This means that earlier rows have a much higher chance of being a parent, and the first row is overwhemingly likely to have the most descendants (as it has a 90% chance that row 2 will have it as a parent, and therefore any descendants of that will also be descendants of row 1…)

CREATE OR REPLACE FUNCTION populate_nodes(count integer) RETURNS void AS $$
BEGIN
  FOR i IN 2..count LOOP
    IF ((SELECT count(*) FROM node) = 0) or (random() < 0.1) THEN
      INSERT INTO node (parent_id) SELECT NULL;
    ELSE
      INSERT INTO node (parent_id) SELECT node_id FROM node OFFSET random() * (SELECT count(*) FROM node) LIMIT 1;
    END IF;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Let's stick 10k records in there
SELECT populate_nodes(10000);

For now, we want to find all descendants of node 1.

I can think of eleven ways we could write this query:

  1. INNER JOIN with a RECURSIVE VIEW
  2. Implicit CROSS JOIN with RECURSIVE VIEW, filtered using a WHERE clause
  3. Sub-query with a RECURSIVE VIEW
  4. INNER JOIN with a MATERIALIZED VIEW based on the RECURSIVE VIEW
  5. Implicit CROSS JOIN (filtered) with MATERIALIZED VIEW based on RECURSIVE VIEW
  6. Sub-query with MATERIALIZED VIEW based on RECURSIVE VIEW
  7. RECURSIVE CTE, using an INNER JOIN
  8. RECURSIVE CTE, using an implicit CROSS JOIN (filtered)
  9. INNER JOIN with RECURSIVE CTE
  10. Implicit CROSS JOIN (filtered) with RECURSIVE CTE
  11. Subquery that is a RECURSIVE CTE

(Whilst some of these seem similar, we’ll see below how they differ).

In all cases, the actual query used for the tree will be:

SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
UNION ALL
SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle

It is extremely likely that the MATERIALIZED VIEW versions will be fastest: it is worth noting that in a very write-heavy environment (where you still need 100% up-to-date data), there would be an extra cost with the REFRESH MATERIALIZED VIEW.

As for which other ones will be fast (or fast enough), I would expect the CTE and VIEW versions to be roughly equivalent, as they appear to do the same amount of work. I’m not sure if the last three will perform as well as the others, as it seems that a “root” CTE would perform better than one later down the track.

So, let’s get underway. I wasn’t able to easily use the benchmark function I wanted to use, so I repeated each query five times and took the average.

We need our views:

CREATE RECURSIVE VIEW node_tree (node_id, ancestors, cycle) AS (
  SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
  SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
  FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
);

CREATE MATERIALIZED VIEW node_tree_mat AS (SELECT * FROM node_tree);

Results

So, some results. I’ll show the query, and then the timing (from EXPLAIN ANALYZE).

#1

SELECT * FROM node INNER JOIN node_tree USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 54.9ms (stddev 1.99)

#2

SELECT * FROM node n, node_tree t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 57.2ms (stddev 2.98)

#3

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree WHERE 1 = ANY(ancestors));

Average Time: 58.5ms (stddev 3.67)

#4

SELECT * FROM node INNER JOIN node_tree_mat USING (node_id) WHERE 1 = ANY(ancestors);

Average Time: 12.2ms (stddev 0.80)

#5

SELECT * FROM node n, node_tree_mat t WHERE n.node_id = t.node_id AND 1 = ANY(ancestors);

Average Time: 11.7ms (stddev 0.90)

This is the fastest query, but not significantly more so than #4.

#6

SELECT * FROM node WHERE node_id IN
  (SELECT node_id FROM node_tree_mat WHERE 1 = ANY(ancestors));

Average Time: 24.0ms (stddev 0.41)

Interestingly, this is much slower than using a JOIN.

#7

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n INNER JOIN node_tree_cte t USING (node_id)
WHERE 1 = ANY(t.ancestors);

Average Time: 97.0ms (stddev 1.87)

Immediately, we see that using a CTE has a performance hit over using a VIEW. Unexpected.

#8

WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
) SELECT n.* FROM node n, node_tree_cte t
WHERE n.node_id = t.node_id AND 1 = ANY(t.ancestors);

Average Time: 96.1ms (stddev 1.16)

#9

SELECT * FROM node INNER JOIN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte USING (node_id);

Average Time: 114.3ms (stddev 4.64)

This is the slowest (but again, only just, and not significantly more than #10).

#10

SELECT * FROM node, (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
) node_tree_cte WHERE node.node_id = node_tree_cte.node_id;

Average Time: 110.0ms (stddev 1.05)

#11

SELECT * FROM node WHERE node_id IN (
  WITH RECURSIVE node_tree_cte(node_id, ancestors, cycle) AS (
    SELECT node_id, ARRAY[]::integer[], FALSE FROM node WHERE parent_id IS NULL
    UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, n.parent_id = ANY(t.ancestors)
    FROM node n, node_tree t WHERE n.parent_id = t.node_id AND NOT cycle
  ) SELECT node_id FROM node_tree_cte WHERE 1 = ANY(ancestors)
);

Average Time: 96.1ms (stddev 5.50)

Discussion

So, it appears that Common Table Expressions are nearly twice as slow as using a RECURSIVE VIEW. I didn’t expect that at all, as I thought they were equivalent. Unsurprisingly, MATERIALIZED VIEW is much faster.

This has some implications for the stuff I was working on: I was using a query of the #11 form (a sub-query that is a WITH RECURSIVE statement), which, as it turns out is just as fast as doing a “root” CTE. However, it’s still far slower than doing a JOIN with a VIEW.

The problem I have now is that there is no way to have the addition of a Field to a Model in django to cause an extra migration operation to be added. One solution would be to manually add a RunSQL operation, but that is messy. I’ll also have to investigate costs of REFRESH MATERIALIZED VIEW.

Generating dummy ABNs in Python

I had the need to generate a fake ABN the other day.

Here’s a python function to do it:

#! /usr/bin/python
import sys
import random

weighting = [10, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

def validate(abn):
    """
    Validate that the provided number is indeed an ABN.
    """
    values = map(int, list(abn))
    values[0] -= 1
    total = sum([x * w for (x, w) in zip(values, weighting)])
    return total % 89 == 0

def abn():
    """
    Generate a random ABN
    """
    value = ''.join([str(int(random.random() * 10)) for i in range(9)])
    temp = list('00%s' % value)
    total = sum([w * x for (w,x) in zip(weighting, map(int, temp))])
    remainder = total % 89
    prefix = 10 + (89 - remainder)
    abn = '%s%s' % (prefix, value)
    assert validate(abn), '%s is not a valid ABN' % abn
    return abn

sys.stdout.write(abn())

For extra goodness, assign it to :abn in TextExpander.

It’s just a shame Xero made their ABN field correctly, and it prevents you typing non-digits into it.

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')

Postgres VIEW meet Django Model

Postgres VIEWs are a nice way to store a subset of a table in a way that can itself be queried, or perhaps slightly or radically changing the shape of your table. It has a fairly simple syntax:

CREATE VIEW "foo" AS
SELECT "bar", "baz", "qux"
FROM "corge"
WHERE "grault" IS NULL;

You may use any valid SELECT query as the source of a VIEW, including one that contains UNION or UNION ALL. You can use this form to create a view that takes two similarly formatted tables and combines them into one logical table. Note that for a UNION to work, the columns (and column types) must be identical between the two parts of the query. A UNION will do extra work to ensure all rows are unique: UNION ALL may perform better, especially if you know your rows will be unique (or you need duplicates).

By default, a Postgres VIEW is dynamic, and read-only. With the use of the CREATE MATERIALIZED VIEW form, it’s possible to have a cached copy stored on disk, which requires an UPDATE MATERIALIZED VIEW "viewname" in order to cause an update.

It’s also possible to create a writeable VIEW, but I’m not going to discuss those now.


There is a feature of Django that makes in really simple to use a VIEW as the read-only source of a Django Model subclass: managed = False.

Given the VIEW defined above, we can write a Model that will allow us to query it:

from django.db import models

class Foo(models.Model):
    bar = models.CharField()
    baz = models.CharField()
    qux = models.CharField()

    class Meta:
      managed = False

Psycopg2 also has the ability to automatically convert values as it fetches them, so you don’t even really need to set the fields as the correct type: but you will probably want to where possible, as an aid to code readability.

In my case, I was returning a two-dimensional ARRAY of TIMESTAMPTZ, but didn’t want to have to include the full code for an ArrayField. So, I just defined it as a CharField, and psycopg2 just gave me the type of object I actually wanted anyway.

There is one little catch, and the code above will not quite work.

Django requires a primary key, even though in this case it makes no sense. You could define any field as a primary key, include a relevant key field from the parent model, or even a dummy value that is the same on every row. Relying on the same psycopg2 trick as above, you could use <tablename>-<id> so as to ensure uniqueness, even though that is not normally a valid value for a Django AutoField.

You probably need to be a little careful here, as if you are doing comparisons, Django will test __class__ and pk for testing equality, so you could hurt yourself if you aren’t careful.

You may also want to prevent write access at the Django level. Overriding save() and delete() on the Model class would be a good start, as well as writing a custom Manager/QuerySet that does the same. You could raise an exception that makes sense, like NotImplemented, or just leave it as a database error.