Postgres Tree Shootout part 1: Introduction.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.

Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.


I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.


  • Insert a single leaf node
  • Insert a single root node (all existing root nodes will then point to this)
  • Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)


  • Remove a single leaf node
  • Remove a single root node (all children of this are promoted to root nodes)
  • Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
  • Remove a subtree (a single non-root node and it’s descendants)
  • Remove a whole tree (a single root node and all descendants)


  • Move a subtree from one parent to another
  • Move a single leaf node to a different parent
  • Move a root node into a tree


  • Fetch all descendants of a given node
  • Fetch the number of descendants of a given node
  • Fetch descendants of a given node to a given depth
  • Fetch the number of descendants of a given node to a given depth
  • Fetch all ancestors of a given node
  • Fetch the number of ancestors of a given node
  • Fetch ancestors of a given node to a given depth
  • Fetch the number of ancestors of a given node to a given depth
  • Fetch all leaf nodes
  • Fetch the number of leaf nodes
  • Fetch all root nodes
  • Fetch the number of root nodes
  • Fetch all non-leaf nodes
  • Fetch the number of non-leaf nodes
  • Fetch all non-root nodes
  • Fetch the number of non-root nodes
  • Fetch all non-root, non-leaf nodes
  • Fetch the number of non-root, non-leaf nodes

Aggregating ranges in Python

This is a bit of a follow-up to Aggregating Ranges in Postgres.

Since we don’t have a nice range type in Python, we will just use a tuple that contains a lower and upper bound. We will assume that this is a canonical form: the lower bound is inclusive, the upper bound is non-inclusive. We will also assume (for simplicity) that the 0-th item is always less than the 1-th item in the tuple.

Given a list of these range-tuples, we want to do the two things we saw in the previous post.

  • Aggregate them into the simplest possible array of range-tuples, using a union operation.
  • Determine the parts that are missing.

We will use in in a similar way:

>>> data =[(2,3),(3,4),(11,12),(5,12),(4,5)]
>>> aggregate_union(data)
(2, 12)
>>> missing_ranges(data)
[(None, 2), (12, None)]

Luckily, None sorts before any integer in python, so we will just be able to use the normal sort.

def aggregate_union(data):
    if not data:
      return None

    sorted_data = sorted(data)
    result = sorted_data[0]

    for lower, upper in sorted_data[1:]:
        # If we ever find result[1] is None, we know it covers any
        # other possible values, so we can stop at that point.
        if result[1] is None:

        if lower > result[1]:
            return None  # Or raise an exception, if you want.

        result = (result[0], max(upper, result[1]))

    return result

The alternative, missing_ranges(data) takes cues from the SQL version too.

def missing_ranges(data):
    if not data:
      return (None, None)

    result = []
    # We do a little fancy stuff here: append an extra item that
    # mimics what we were able to use lead for, but in a different
    # way so we can use [i + 1] later.
    sorted_data = sorted(data) + [(None, None)]

    if sorted_data[0][0] is not None:
        # Note: the upper bound here is not quite correct!
        result.append((None, sorted_data[0][1]))

    for i, (lower, upper) in enumerate(sorted_data):
        # Grab the next item in our sorted list. Normally, we would check
        # to see if we are at the end, but we will rely on the break later
        # on: we can always stop processing when the upper bound of our
        # next item is `None`
        next = sorted_data[i + 1]

        if upper < next[0] or (next[0] is None and upper is not None):
            result.append((upper, next[0]))

        # Now, exit before we ever get to a bounds error on getting the next item.
        if next[1] is None:

    return result

However, there is a problem here that is nicely solved by the Range types within postgres:

>>> missing_ranges(data)
[(None, 3), (12, None)]

We need to subtract from the first object’s upper bound. Which is easy with integer tuple-ranges (basically, any discrete range type), but not so much with continuous ranges.

Aggregating Ranges in Postgres

Postgres’ range types are very cool. You can use them to store, as you may guess, a value that is a range. There are several included range types, and it’s possible to create your own range type. For now, we’ll just look at using a simple int4range: although everything in principle could be applied to any range type.

Firstly, a quick discussion of how the range types work.

There is a literal form, and then a functional form:

SELECT '(2,17]'::int4range;

 (1 row)

SELECT int4range(2, 17, '(]');

(1 row)

The first thing you’ll notice is that postgres converts them to canonical form. We can provide the bound-types: inclusive or exclusive upper and lower values. This is the same notation you might see if you have done some mathematics.

You can also omit the upper and/or lower bounds to get a range object that goes to ±Infinity. Finally, you can also have empty ranges.

There are several operations that can be performed on a range. The most interesting one from the perspective of this article is the UNION operation.

SELECT '[3,17)'::int4range + '[10,20)';

(1 row)

You’ll get an error if your union does not result in a contiguous range. There is no way to store a discontinuous range in postgres (but you could store them in an array of int4range[], for instance).

What about if we want to get the aggregate range from a set of ranges?

SELECT value FROM range_test ;
(4 rows)

What we’d like to be able to do is something like:

SELECT aggregate_union(value) FROM range_test;

Obviously, this would be subject to the same limitation on union: we would get an error (or a NULL) if the aggregate range was not continuous.

Perhaps even more useful might be a function that would show us what ranges are missing from a set of ranges. With the same data above, we might see:

SELECT missing_ranges(value) FROM range_test ;                                   missing_ranges
(1 row)

Indeed, the latter is probably more useful, and, as it turns out, is simpler to perform.

We’ll start with the aggregate_union though, because it’s fun. It’s also the way I worked out the nicer solution for the last problem.

We need to create a postgres AGGREGATE to, well, aggregate the data from columns in a table. A naïve solution might be:

CREATE FUNCTION _aggregate_union(int4range, int4range)
RETURNS int4range AS $$

  $1 + $2, $1, $2


CREATE AGGREGATE aggregate_union (int4range) (
  sfunc = _aggregate_union,
  stype = int4range

This actually works to some degree: but only if the ranges in the query are already sorted, as each iteration of the aggregation function must result in a valid range:

SELECT aggregate_union(value) FROM (SELECT value FROM range_test ORDER BY value) x;
(1 row)

However, if the data is not sorted, it will fail.

Instead, we have to either collect all of the items in an array, sort this, and then attempt to aggregate, or, at each step, aggregate as much as possible, and add the current element to the array if we cannot perform a union.

The simpler of these (but will take more memory) is to stick them all into an array, and then sort and apply the union. We only need to define one new function to do it this way:

CREATE OR REPLACE FUNCTION _aggregate_union(int4range[])
RETURNS int4range AS $$

  _range int4range;
  _current int4range;


  _current := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  FOR _range IN SELECT unnest($1) x ORDER BY x LOOP
    IF _range && _current OR _range -|- _current THEN
      _current := _current + _range;
    END IF;

  RETURN _current;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE aggregate_union (int4range) (
  stype = int4range[],
  sfunc = array_append,
  finalfunc = _aggregate_union

Lets test it out:

SELECT aggregate_union(value) FROM range_test;
(1 row)


But what about the other case? What if we care more about what data is missing?

After spending way too many hours on playing around with this, I hit on the idea of using a window function, lead to get the data from the “next” row in the query.

  lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
ORDER BY value;

This gets us most of the way. We can see the ones with upper >= lead indicate there is no gap between the ranges, so we can filter. However, we need to do this with a subquery to be able to get access to the columns correctly:

SELECT upper, lead FROM (
    lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
  ORDER BY value
) x WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL);

We can aggregate these into an array, and then prefix with an element if our first object is not infinite-lower-bounded.

CREATE OR REPLACE FUNCTION missing_ranges(int4range[])
RETURNS int4range[] AS $$

  _range int4range;
  _missing int4range[];

  _missing := (SELECT
    array_agg(int4range(upper, lead, '[)'))
    FROM (
      SELECT lower(x), upper(x), lead(lower(x)) OVER (ORDER BY lower(x) NULLS FIRST)
      FROM unnest($1) x ORDER BY lower NULLS FIRST
    ) x
    WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL)

  _range := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  IF NOT lower_inf(_range) THEN
    _missing := array_prepend(int4range(NULL, lower(_range), '[)'), _missing);

  RETURN _missing;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE missing_ranges (int4range) (
  sfunc = array_append,
  stype = int4range[],
  finalfunc = missing_ranges

All too easy.

It is possible to rewrite this function just using SQL, but it’s not pretty:

SELECT array_agg(value)
  (SELECT value
     (SELECT int4range(UPPER, lead, '[)') AS value
                NULL::integer AS UPPER,
                LOWER(a.value) AS lead
         FROM range_test a
         ORDER BY a.value LIMIT 1) w
      WHERE lead IS NOT NULL
      UNION SELECT int4range(UPPER, lead, '[)')
        (SELECT LOWER(b.value),
                lead(LOWER(b.value)) OVER (ORDER BY LOWER(b.value) NULLS FIRST)
         FROM range_test b
         ORDER BY b.value) x
        AND (LEAD IS NULL OR UPPER < lead)
      ) y
   ORDER BY value) z;

I haven’t tried to see which is faster.

Misfit Shine

iOS 8’s HealthKit is a really nice feature. Prior to this, I was using Azumio’s Argus app for “life-tracking”, but it was really not providing me with much. It showed my my daily steps, included my exercise from my Garmin Forerunner 610, and helped me track my water, coffee and alcohol intake. It integrated really well with a Heart Rate recording app that they also make, but there was no real way to do anything else with this data. I was able to take photographs of my food, but that’s really it.

Along comes HealthKit.

All of a sudden, I’m no longer tied to a single application, or suite of applications that are engineered to work together. Instead, I can use Strava for my run tracking (still via my Garmin, or directly on the phone if I left it at home). These workouts, including calorie estimations, are available in HealthKit.

Similarly, step counts directly from my iPhone are also available, as are Heart Rate readings, either from the app I was using before, or an alternative. I can also add in extra values, for instance my max HR from a workout. As an aside, it would be really nice if Strava could push this value, removing the need for me to manually do it.

Even better for me would be the ability for Strava to send step counts for a workout: I usually run with my FootPod, which gives cadence information. From this, it’s possible to infer the number of steps I took in that workout. Because of this missing feature, I wound up taking my iPhone when I run, a less than optimal situation.

A couple of weeks ago, a co-worker introduced me to myfitnesspal. This is a calorie tracking app and website. Since I have (had?) about 10kg to lose, and something that makes getting relatively accurate calorie counts on the food I’m eating (although lots of the data in there is really, really bad: I’m forever looking at the nutritional breakdown of a suggested match and in some cases adding my own versions), in conjuction with the calorie usage of my excercise has become simple. Even fun.

I’m less concerned about trying out different apps on the iPhone, simply because the data (or most of it) just goes into HealthKit, which means if a better alternative comes along, I can just switch. I’m unsure if HealthKit data is restored from a backup - that’s something that I’ll find out eventually when I upgrade my phone, I guess.

Using my iPhone for step tracking means I need to have it in my pocket all of the time. This is actually really bad, as I find myself taking it out and using it more than I (or my family) would like. So, last night, I bought a Misfit Shine.

To be honest, the main reason is so that I get my phone out less, but I’m also looking forward to not taking my phone running. However, I’ve become quite attached to listening to podcasts whilst running. May have to find the little iPod nano (the little square one) I have floating around: but I’ll want to investigate the podcast syncing, since the other place I listen to them is in the car (via Bluetooth).

Having had the Misfit Shine for less than 24 hours, it’s still too early to discuss if it’s going to work for me. I do have some early comments though.

  • It’s pretty. The nice matt grey finish is unobtrusive and classy.

  • The magnet is nice and strong. I haven’t clipped it on anywhere yet, but it feels like it may stay attached while running. I’ve only used the watchband style fastener. I

  • Syncing is not automatic, even with “Automatic Sync” enabled. Indeed, I have found syncing somewhat intermittent. I’m not sure, but I think I need to tap the Shine as well as the iPhone to get it to sync. That’s a bit shit. Even if the syncing was done through a Notification Area widget, that would be an improvement. If you are only using the Shine itself for your tracking, it would not be so necessary, but since I use that data within myfitnesspal (and other HealthKit related apps), I want it updated as frequently as possible. Every minute would be great, hourly would be acceptable.

    (It seems that you need to put the Shine on the phone to sync. That would explain my issues with it, but that’s even shitter than I thought. I’m not really prepared to take a device off to sync it. Hopefully this is something that improves, although I doubt it.)

  • Sleep tracking seems reasonable, but it should be possible to trim sleep without removing the “sleep quality” data. I was using an app-based sleep tracker, but that made my phone really, really hot. And it kept falling off the bed.

    Since then, I’ve been using “Lark”, which really drives me batty with it’s cute dialog, but seems to be the only thing that will use device motion to guess when you have been sleeping (assuming you put your phone down immediately before sleep, and move it as soon as you wake). Unfortunately, I’ll need to keep that going, or manually transfer data from Misfit to HealthKit.

  • Step count data is pushed to HealthKit. No other HealthKit integration is present. It really needs Sleep data to be pushed; Weight and Dietary Calorie ideally should be fetched from HealthKit too. There is integration with myfitnesspal, but I’d like to be able to switch Dietary Calorie tracker if necessary in the future. I suspect that HealthKit is something that really could be improved in software, fairly easily.

  • The “clock” feature is kind-of cool. Except the “hour” light does not change when the time is after half-past. I kept thinking the time was quarter to nine, when it was quarter to ten. Luckily, it’s Saturday.

  • It’s also waterproof, which means I don’t have to take it off to give the boys a bath, which is nice.

There’s a fairly reasonable review from over a year ago at A weekend with Misfit’s Shine: this makes me suspect that the syncing may not improve much.

I’m not unhappy with the device so far. I’d really like to get more integration with HealthKit, so hopefully that will be forthcoming.

Horizontal Partitioning in Postgres

It never surprises me when I find another neat feature of Postgres that makes doing a potentially difficult task simple. Today, I discovered that since 9.0, Postgres has supported a really powerful way to horizontally partition data into separate tables.

For those who haven’t heard of the concept before, horizontal partitioning is where different rows are stored in different tables, depending upon something about the data within the row.

For instance, we could partition audit data into tables based upon the timestamp of the action. Thus, all entries created during October, 2014 would be stored in a table called audit_2014_10, and entries created during March 2011 would be stored in a table called audit_2011_03. And so on. Alternatively, we could have a single table per year, or however we want to partition.

This is called “Range” partitioning.

There are a couple of ways you could think about doing this in a DBMS. You could have a writable view that redirects the writes to the correct table. The problem then is that when you add a new child table, you need to rewrite your view.

Instead, we can use Postgres’ neat table inheritance to handle all of this for you. Indeed, it is discussed in the Postgres documentation.

If we inherit one table from another, and do a query on the parent table, we will also get the rows that match the query from all child tables. That obviates the need for a view that uses UNION ALL or similar to fetch the data.

I’m going to use a toy example here, that just contains a single column.


CREATE TABLE "data_2014" (
    "value" >= '2014-01-01' AND "value" < '2015-01-01')
) INHERITS ("data");

CREATE TABLE "data_2015" (
    "value" >= '2015-01-01' AND "value" < '2016-01-01')
) INHERITS ("data");

This, however, is only part of the picture. Any data that is added to either of the child tables (or indeed the parent table, but we won’t be doing that), will be returned when we query the parent table.

But we want to ensure that data is partitioned out nicely. For that, we can use a trigger.

A naïve trigger may look something like:

CREATE OR REPLACE FUNCTION data_insert_trigger()


  IF (NEW.value >= '2014-01-01' AND NEW.value < '2015-01-01') THEN
    INSERT INTO data_2014 VALUES (NEW.*);
  ELSIF (NEW.value >= '2015-01-01' AND NEW.value < '2016-01-01') THEN
    INSERT INTO data_2015 VALUES (NEW.*);
    RAISE EXCEPTION 'Date out of range. Please fix the data_insert_trigger() function.';


$$ LANGUAGE plpgsql;

As you can see by the ELSE clause, we will actually need to do maintainence on this function as we start to get data that falls outside of our existing ranges. We will also need to create a new table for those rows.

It would be nice if we could automatically create tables that are missing, and handle any arbitrary values.

CREATE OR REPLACE FUNCTION data_insert_trigger()

  table_name text;
  year integer;
  start text;
  finish text;

  year := date_part('year', NEW.value);
  table_name := 'data_' || year;
  start := year || '-01-01';
  finish := (year + 1) || '-01-01';

  PERFORM 1 FROM pg_tables WHERE tablename = table_name;

      || quote_ident(table_name)
      || ' (CHECK ("value" >= '
      || quote_literal(start)
      || ' AND "value" < '
      || quote_literal(finish)
      || ')) INHERITS (data)';

    || quote_ident(table_name)
    || ' VALUES ($1.*)'


$$ LANGUAGE plpgsql;

CREATE TRIGGER data_insert_trigger
FOR EACH ROW EXECUTE PROCEDURE data_insert_trigger();

You would also want to create any indexes on the child tables, as this is where they need to be, rather than the parent table.

This function is pretty neat: it first stores what the table name should be in a variable, as well as the two bounds for this table (start and finish). Then, we see if that table exists, and if not, create it. Finally, we then insert the values into the correct child table. I’m not sure I’d recommend using it as-is: it’s quite possibly subject to a race condition if two new records came in at the same time.

The one thing that was concerning me was that DDL changes to the parent table would not propagate to the child tables: however this turned out to not be an issue at all. Since I mostly use Django, I want as little hard stuff that would require custom migration operations.


The other thing worth noting is that Postgres will do a really good job of limiting the tables that are accessed to those that contain the relevant data:

  ('2014-01-06 09:00:00'),
  ('2015-01-09 12:00:00'),
  ('2016-02-22 15:39:00');

WHERE value > '2014-01-01'
AND value < '2014-07-01';
                                      QUERY PLAN
 Append  (cost=0.00..42.10 rows=12 width=8)
   ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
   ->  Seq Scan on data_2014  (cost=0.00..42.10 rows=11 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
 Planning time: 0.292 ms
(6 rows)

We see that only the data_2014 table is hit by the query. If your constraints do something like cast to DATE, then this may not happen. This was causing me some concern earlier, but letting Postgres coerce the data types fixed it.

However, you can’t use a tstzrange to query if you want these constraints to help the query planner:

-- Actually hits every table.
SELECT * FROM data WHERE value <@ '[2014-01-01, 2014-07-01)'::tstzrange;

It’s worth noting that if you change a value that would cause that row to belong in a different partition, this will fail.

There are moves afoot to have this feature more tightly integrated into Postgres, perhaps using a syntax like:

(PARTITION data_2014 VALUES LESS THAN '2015-01-01');

CREATE PARTITION data_2015 ON data VALUES LESS THAN '2016-01-01';

It’s not clear to me how you actually define the range. It also seems counter-productive to have to manually create the partition tables.

There are also tools that might be useful to handle the heavy lifting, like pg_partman. I haven’t used this, but it looks interesting.

Boiled Chocolate Cake

Boil together:

  • 125g butter
  • 1 cup water
  • 1 1/2 cups sugar
  • 2 tablespoons cocoa
  • 1/2 teaspon bicarb soda

Simmer 5 mins.


  • 2 eggs
  • 1 1/2 cups self raising flour

Bake about 1/2 an hour at 180°C in 2 log tins or a round tin.

Don’t eat it all at once.

iOS 8 knows what you want to say

Had a mind-blowing moment when messaging using iOS 8 tonight.

Me: Can you get some more coffee beans please?

Her: Ok, caf or decaf?

This is where it gets awesome. The three choices I had in my autocomplete bar were:

  1. Ok
  2. Caf
  3. Decaf

So, it apparently parses messages, and sets the choices accordingly.

I only wish I had a screenshot. That would have been less explaining.

Read-only data from your database in Django

I had the need to create a column in some database tables that are completely controlled by the database, but the value of which is sometimes needed by the Django object.

It should never be presented in a Form, and never, ever be written to by the Django infrastructure.

So, we need a way to fetch the data from the database, but, even if the value is changed, and the object saved, is not written back.

The detail of how this data is set in the database is irrelevant: it’s a column that gets it’s value from a sequence (and, incidentally, this sequence is shared across multiple tables).

But, we need a way to get this data.

A nice technique is to leverage two parts of Django: the QuerySet.extra(select={}) method to actually add this field to the query, and Manager.get_query_set() (get_queryset() in older versions of Django) to make this apply to every fetch of the objects.

Our extra column will be called sequence_number

class SequenceNumberManager(models.managers.Manager):
    def get_query_set(self):
      return super(SequenceNumberManager, self).get_query_set().extra(select={
        'sequence_number': '"%s"."sequence_number"' % self.model._meta.db_table

class Thing(models.Model):
    # Column definitions. Do not define sequence_number!

    objects = SequenceNumberManager()

That’s it.

Now, Thing.objects.all()[0].sequence_number will give you the value from the database. Because it’s not a Django field, it will never be written back to the database.

The SequenceNumberManager, as it stands, could be applied to multiple models, as it dynamically looks for the database table that should be used for the field lookup. You need to define the table name, as otherwise if you join to another table with the same column name, your database will (correctly) reject the query, as it is ambiguous to which table it refers.

I’ve done a similar thing, but using django-model-utilsPassThroughManager in conjunction with a QuerySet subclass, since I’m a big fan of the approach described at Building a higher-level query API: the right way to use Django’s ORM, which I link to at least once a week in IRC.

I’d be interested to hear of alternative methods of achieving this, or if there are any possible drawbacks. I think you could do it with a custom field (that never wrote to the database), but I think that would actually be a whole lot more code. There is also the benefit in this case that you could have the .extra() call as a queryset method, so you only add it when you need it, although the performance gains there would be insignificant, I suspect.

Adding JSON(B) operators to PostgreSQL

This is a followup/replacement post for this older post. In it, I discussed a way to use PL/python to perform function (and in turn operations) on JSON data. Don’t do that, it’s way too slow.

It’s actually possible, using raw SQL (and some handy functions) to perform some of the operations that are “missing” from the JSON(B) datatypes in Postgres.

In all cases, I’ll work with just JSONB as the input/output formats. In practice, when I first wrote these, I wrote up to four versions of each (JSON+JSON, JSON+JSONB, JSONB+JSON, JSONB+JSONB). I believe it’s possible to write polymorphic versions of these functions, but I’m not that familiar with them just yet.

We’ll start with concatenation: basically joining two JSONB objects into one. This is a good one to start with, as operands must be either JSON or JSONB, so all forms of this function are the same, just with different functions or casting of operands.

Before I begin, I’ll mention a post I saw on Michael Paquier’s excellent blog: Manipulating jsonb data by abusing of key uniqueness. In it, Michael uses the json_object_agg function to build up a JSON object from a query:

CREATE FUNCTION "json_append" (jsonb, jsonb) RETURNS jsonb AS $$

WITH json_union AS
  SELECT * FROM jsonb_each($1)
  SELECT * FROM jsonb_each($2)
SELECT json_object_agg(key, value) FROM json_union;


This seems like a good idea, however it actually performs around two orders of magnitude slower than just iterating over the objects and building them up using string_agg and ||:

CREATE FUNCTION "json_concatenate" (jsonb, jsonb) RETURNS jsonb AS $$

SELECT ('{' || string_add(to_json("key")::text || ':' ||"value", ',') || '}')::jsonb
  SELECT * FROM jsonb_each($1) UNION ALL SELECT * jsonb_each($2)


  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_concatenate

It seems that this is not just because of the Common Table Expression (although, using a CTE does make the second function perform just as poorly).

  'jsonb_concatenate(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)',
  'jsonb_append(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)'
           code              |  runtime   | corrected
 [Control]                   | 0.00550699 |          0
 jsonb_concatenate(...)      | 0.00652981 | 0.00102282
 jsonb_append(...)           |   0.484099 |   0.478592

My attempt at reimplementing the json_object_agg aggregate function in SQL proved even slower. Not at all surprising.

The next one we will tackle is the - operator from Hstore.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

We can reimplement these for JSON: first as functions, and then create operators using those if we want. What is interesting is that we use the same construction pattern for our output JSONB object, however, I don’t seem to be able to figure out how to extract this out into another function.

  "json" jsonb,
  "remove" TEXT
  RETURNS jsonb
AS $function$
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json") -- Until this function is added!
    WHERE "key" <> "remove"),
ELSE "json"

  LEFTARG = jsonb,
  RIGHTARG = text,
  PROCEDURE = jsonb_subtract

You’ll notice that there’s a test for if the key to remove is in the object first: this should be much faster in the situation where it doesn’t appear, as we then don’t need to recreate the object.

The other forms are quite similar, but the WHERE clause varies, and the initial test varies or is removed:

  "json" jsonb,
  "keys" TEXT[]
  RETURNS jsonb
AS $function$
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "key" <> ALL ("keys")),
ELSE "json"

  LEFTARG = jsonb,
  RIGHTARG = text[],
  PROCEDURE = jsonb_subtract

  "json" jsonb,
  "remove" jsonb
  RETURNS jsonb
AS $function$
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "json"->>"key" <> ("remove"->>"key")),

  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_subtract

There is also the #= operator for hstore: but this seems to be record #= hstore, rather than hstore #= hstore. I’m not sure how to implement this, but I can implement a jsonb #= jsonb function:

CREATE OR REPLACE FUNCTION "jsonb_update_only_if_present"(
  "json" jsonb,
  "other" jsonb
  RETURNS jsonb
AS $function$

  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM (SELECT * FROM jsonb_each("json") UNION ALL SELECT * FROM jsonb_each("other")) AS a
     WHERE "json" ? "key"::text

  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_update_only_if_present

Finally, there are operators %% (convert to array of alternating key/value pairs), and %# (convert to 2-dimensional array of keys, values).

I haven’t figured out a way to create these yet either.

So, that’s what I’ve got so far. Obviously, these will be slower than pure C implementations however, we can run some benchmarks against the hstore operators for comparisons.

   ' ''{"a": 1, "b":2}''::jsonb || ''{"a":2}''::jsonb',
   ' ''a=>1, b=>2''::hstore || ''a=>2''::hstore'
                     code                      |  runtime  |  corrected
 [Control]                                     | 0.0837061 |           0
  '{"a": 1, "b":2}'::jsonb || '{"a":2}'::jsonb | 0.0842681 | 0.000561953
  'a=>1, b=>2'::hstore || 'a=>2'::hstore       | 0.0843148 | 0.000608683
(3 rows)

Whoa. That’s actually pretty good!

And similar for subtract:

  ' ''{"a": 1, "b":2}''::jsonb - ''a''::text ',
  ' ''a=>1, b=>2''::hstore - ''a''::text '
                  code                  |  runtime  | corrected
 [Control]                              | 0.0818689 |          0
  '{"a": 1, "b":2}'::jsonb - 'a'::text  |  0.083431 | 0.00156212
  'a=>1, b=>2'::hstore - 'a'::text      |  0.083159 | 0.00129008
(3 rows)

I might have to run these with some more complicated queries, and compare the results.

Using Postgres Composite Types in Django

Note: this post turned out to be far more complicated than I had hoped. I may write another one that deals with a less complicated type!

Postgres comes with a pretty large range of column types, and the ability to use these types in an ARRAY. There’s also JSON(B) and Hstore, which are useful for storing structured (but possibly varying) data. Additionally, there are also a range of, well, range types.

However, sometimes you actually want to store data in a strict column, but that isn’t a simple scalar type, or one of the standard range types. Postgres allows you to define your own composite types.

There is a command CREATE TYPE that can be used to create an arbitrary type. There are four forms: for now we will just look at Composite Types.

We will create a Composite type that represents the opening hours for a store, or more specifically, the default opening hours. For instance, a store may have the following default opening hours:

|    Day     |  Open  |  Close  |
|  Monday    |  9 am  |  5 pm   |
|  Tuesday   |  9 am  |  5 pm   |
|  Wednesday |  9 am  |  5 pm   |
|  Thursday  |  9 am  |  9 pm   |
|  Friday    |  9 am  |  5 pm   |
|  Saturday  | 10 am  |  5 pm   |
|  Sunday    | 11 am  |  5 pm   |

During the Christmas season this store may be open longer (perhaps even 24 hours). There may also be differences at Easter time, or other public holidays, where the store is closed, or closes early.

It would be nice to be able to store the default opening hours for a store, and then, when creating a week, use these to create concrete (TIMESTAMP) values for each day, which could be overridden on any given day.

There are a few ways we could model this. Postgres has no timerange type, so that’s out. We could create a RANGE type, or we could use (start-time, finish-time). But what about when a store is open after midnight, or for 24 hours? Storing this data implicitly is a real pain, because you need to always check to see if the finish time is less than (or equal to) the start time whenever doing anything. Trust me, this is not the best approach.

An alternative I’ve been toying with is (start-time, interval). You could limit it so that the interval’s maximum is '1 day', but not (from what I can tell) when you define the type. Anyway, the syntax for creating this type is:

CREATE TYPE opening_hours AS (
  start time,
  length interval

As an aside, every table in the database also has an associated type (of the same name as the table).

Now, we have our type: we can use it in a table:

  name TEXT

CREATE TABLE default_opening_hours (
  store_id INTEGER REFERENCES store (store_id),
  monday opening_hours,
  tuesday opening_hours,
  wednesday opening_hours,
  thursday opening_hours,
  friday opening_hours,
  saturday opening_hours,
  sunday opening_hours

An alternative way of storing this information might be to use an array of opening_hours, directly on the store model. We’ll use this one instead, as it’s a little neater (and means we will look at how to use opening_hours[] later too).

  name TEXT,
  default_opening_hours opening_hours[7]

Now, we can put data in there:

INSERT INTO store (name, default_opening_hours) VALUES
  'John Martins',
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('09:00', '12:00')::opening_hours,
    ('09:00', '08:00')::opening_hours,
    ('10:00', '07:00')::opening_hours,
    ('11:00', '06:00')::opening_hours

Note how we need to cast all of the values from record to opening_hours.

In practice, we would probably also want to have some type of restriction where the opening time from one day, plus the default open hours is less than or equal to the starting time on the next day. I’m still not sure of the best way to do this in Postgres, but it is possible to do it in Django.

Speaking of Django, we want to be able to access this data type there. We can leverage a really nice feature of Psycopg2 to have these values automatically turned into a Python namedtuple. We do this by registering the type within Psycopg2, using the Django cursor.

from django.db import connection
from psycopg2.extras import register_composite

register_composite('opening_hours', connection.cursor().cursor)

But, this is only half of the pattern. We also need to register an adapter so that values going back the other way are also automatically cast into opening_hours.

from django.db import connection
from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs

# Get a reference to the namedtuple class
OpeningHours = register_composite(

def adapt_opening_hours(value):
  return AsIs("(%s, %s)::opening_hours" % (

register_adapter(OpeningHours, adapt_opening_hours)

Now, we can fetch data from the database, and know that we will get OpeningHours instances, and, when passing an OpeningHours instance back to the database, know it will be converted into the correct type.

Obviously, in order to do this, the type must exist in the database. We did that manually in this case. In a real situation you would want to do that as a database migration. And that is where things get tricky. You can’t run the register_adapter function until the type exists in the database. I did come up with a relatively neat workaround for this when writing a framework for generic Composite fields, where the registration of the composite type attempts to execute, and if it fails, it stores the data for later registration, and then the actual migration operation fires off a signal, which is handled by a listener that actually performs the registration.

The final piece of the puzzle is the Django Field subclass, which is actually not that complicated. In essence, we are relying on Psycopg to handle the adaptation in both directions, so it can be a bare field (perhaps with a formfield method to get a custom form field). In practice, I wrote the generic CompositeField subclass, which uses some metaclass magic to handle the late registration:

from django.db.models import fields
from django.db import connection
from django.dispatch import receiver, Signal

from psycopg2.extras import register_composite
from psycopg2.extensions import register_adapter, adapt, AsIs
from psycopg2 import ProgrammingError

_missing_types = {}

class CompositeMeta(type):
    def __init__(cls, name, bases, clsdict):
        super(CompositeMeta, cls).__init__(name, bases, clsdict)

    def register_composite(cls):
        db_type = cls().db_type(connection)
        if db_type:
                cls.python_type = register_composite(
            except ProgrammingError:
                _missing_types[db_type] = cls
                def adapt_composite(composite):
                    return AsIs("(%s)::%s" % (
                        ", ".join([
                            adapt(getattr(composite, field)).getquoted() for field in composite._fields
                        ]), db_type

                register_adapter(cls.python_type, adapt_composite)

class CompositeField(fields.Field):
    __metaclass__ = CompositeMeta
    A handy base class for defining your own composite fields.

    It registers the composite type.

composite_type_created = Signal(providing_args=['name'])

def register_composite_late(sender, db_type, **kwargs):

We also want to have a custom migration operation:

from django.db.migrations.operations.base import Operation

# Or wherever the code above is located.
from .fields.composite import composite_type_created

class CreateCompositeType(Operation):
    def __init__(self, name=None, fields=None): = name
        self.fields = fields

    def reversible(self):
        return True

    def state_forwards(self, app_label, state):

    def database_forwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('CREATE TYPE %s AS (%s)' % (
  , ", ".join(["%s %s" % field for field in self.fields])

    def state_backwards(self, app_label, state):

    def database_backwards(self, app_label, schema_editor, from_state, to_state):
        schema_editor.execute('DROP TYPE %s' %

This is a bit manual, however. You need to create your own migration that creates the composite type, and then begin to use the field.

# migrations/

class Migration(migrations.Migration):
    dependencies = []

    operations = [
                ('start', 'time'),
                ('length', 'interval')

The place this pattern falls down is that this migration must be manually created: we don’t have any way to automatically create the migration from the Field subclass, which just looks like:

class OpeningHoursField(CompositeField):

    def db_type(self, connection):
        return 'opening_hours'

    def formfield(self, **kwargs):
        defaults = {
            'form_class': OpeningHoursFormField
        return super(OpeningHoursField, self).formfield(**defaults)

I think in the future I’ll attempt to use further metaclass magic to allow defining the fields of the Composite type. This could then be used to automatically create a form field (which is a subclass of forms.MultiValueField).

class OpeningHoursField(CompositeField):
    start = models.DateField()
    length = IntervalField()

    def db_type(self, connection):
        return 'opening_hours'

However, in the meantime, we can still get by. I’m not sure it’s going to be possible to inject extra operations into the migration based upon the field types anyway.

Finally, we can use this in a model:

class Store(models.Model):
    store_id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=128)
    default_opening_hours = ArrayField(
        base_field=OpeningHoursField(null=True, blank=True),

I’ve used the ArrayField from django.contrib.postgres, purely for illustration purposes.

The CompositeField and associated operation are part of my django-postgres project: once I have worked out some more kinks, I may submit a pull request to django.contrib.postgres, unless someone else beats me to it.

Oh, and a juicy little extra. Above I mentioned something about preventing overlaps. The logic I use in my form is:

from django import forms
from django.utils.translation import string_concat, ugettext_lazy as _

import postgres.forms

from .fields import OpeningHoursFormField
from .models import Store

def finish(obj):
    "Given an OpeningHours value, get the finish time"
    date =, 1, 1)
    return (datetime.datetime.combine(date, obj.start) + obj.duration).time()

class StoreForm(forms.ModelForm):
    OVERLAPS_PREVIOUS = _('Open hours overlap previous day.')

    default_opening_hours = postgres.forms.SplitArrayField(

    class Meta:
        model = Store

    def clean_default_opening_hours(self):
        opening_hours = self.cleaned_data['default_opening_hours']
        field = self.fields['default_opening_hours']

        # Ensure consecutive days do not overlap.
        errors = []

        for i in range(7):
            today = opening_hours[i]
            if today.start is None or today.duration is None:

            yesterday = opening_hours[(i + 6) % 7]

            if yesterday.start is None or yesterday.duration is None:

            if finish(yesterday) <= yesterday.start:
                if today.start < finish(yesterday):
                        params={'nth': i}

        if errors:
            raise forms.ValidationError(errors)

        return opening_hours

I’m currently not displaying the duration/length: I dynamically calculate it based on the entered start/finish pair, but that’s getting quite complicated.