Bowled over by Postgres

There’s a nice article at Bowled Over by SQL Window Functions by a chap called Dwain Camps. It’s written from the perspective of T-SQL, which has some differences to Postgres’s DDL and querying. I’ve reworked his stuff into what feels nice for me from a Postgressy perspective.

I’ll recap a couple of things he mentions, but you’ll probably want to head there and read that first.

  • Strike: all pins knocked down on the first ball of a frame. Scores 10 (from this frame), plus the total of whatever the next two balls knock down.
  • Spare: all pins knocked down on the second ball of a frame. Scores 10 (from this frame), plus how ever many pins you knock down on the next ball.
  • Open: at least one pin remains standing at the end of the frame. Score is how ever many pins you knocked down.

By convention, only the running tally is shown on the scoresheet. I’ve kept the frame score as the score for this frame only, and the total will contain the running total.


The first thing I do a bit differently is to use Postgres’ DOMAIN structures, which enables us to remove some of the check constraints, and simplify some others:

CREATE SCHEMA bowling;

CREATE DOMAIN bowling.frame_number AS integer
  CHECK ('[1,10]'::int4range @> VALUE)
  NOT NULL;

CREATE DOMAIN bowling.ball AS integer
  CHECK ('[0,10]'::int4range @> VALUE);

So, now we have two integer domains: the frame number may be between 1 and 10, and the ball pin count may be null, or between 0 and 10.

We’ll start by recreating the table structure: initially without constraints:

CREATE TABLE bowling.frame
(
  game_id INTEGER NOT NULL,
  player_id INTEGER NOT NULL,
  frame bowling.frame_number NOT NULL,
  ball1 bowling.ball NOT NULL,
  ball2 bowling.ball NULL,
  ball3 bowling.ball NULL,
  PRIMARY KEY (game_id, player_id, frame)
);

Not much different to the original, other than using those fresh new domain types.

The other approach I’ve used is to use more constraints, but make them simpler. I’m also relying on the fact that X + NULL => NULL, which means we can leave off a heap of the constraint clauses.

We’ll start by preventing the (non-final) frames from exceeding the number of available pins. In the case of final frame, we still only allow a spare unless we have a strike already.

ALTER TABLE bowling.frame
ADD CONSTRAINT max_spare_unless_frame_10_strike CHECK
(
  ball1 + ball2 <= 10 OR (frame = 10 AND ball1 = 10)
);

This is as simple as it can be. Because ball 2 may be null but ball 1 may not, and each ball must be no greater than 10, this is enough to encapsulate the requirement. There is one slightly incorrect circumstance: a value of (10, 0) would be valid, which is strictly incorrect (ball 2 was never bowled). In the case of the calculations it’s all correct, but if a 0 was bowled immediately after a strike, it may be possible to insert that as ball 2, which would be misleading.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_2_never_bowled_after_strike CHECK
(
  ball2 IS NULL OR ball1 < 10 OR frame = 10
);

We can now prevent ball 3 from being set unless we are on frame 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_3_only_in_frame_10 CHECK
(
  ball3 IS NULL OR frame = 10
);

A follow-up to the previous constraint: we only get to bowl ball 3 if we have bowled ball 2, and scored a strike or spare already. Note, we may have a strike on the first ball, which means we might have more than 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_only_if_eligible CHECK
(
  ball3 IS NULL OR (ball2 IS NOT NULL AND ball1 + ball2 >= 10)
);

Finally, we have some specific allowable conditions for that last ball. We already know that we must have scored a strike or spare with the first two balls, but we need to know how many pins are available to us.

If we scored a spare or two strikes, then we may have any number up to 10 to score now. Otherwise, we have however many pins were left by ball 2 (which means ball 1 must have been a strike).

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_max_spare_or_strike CHECK
(
  ball2 + ball3 <= 10
  OR
  ball1 + ball2 = 20
  OR
  ball1 + ball2 = 10
);

I find those constraints much easier to read that the original ones.


I’ve written a view, that uses a couple of Common Table Expressions (CTEs), as well as the window functions Dwain discussed.

CREATE OR REPLACE VIEW bowling.frame_score AS (
  WITH pin_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      -- Get the first ball from the next frame.
      -- Used by strike and spare.
      LEAD(ball1, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_1,
      -- Get the second ball from the next frame.
      -- Used by strike.
      LEAD(ball2, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_2,
      -- Get the first ball from the next next frame.
      -- Used by double-strike.
      LEAD(ball1, 2) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_next_ball_1
    FROM bowling.frame
  ),
  frame_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      CASE
      -- We will start with frame 10: when we have a strike
      -- or spare, we get all three balls.
      WHEN frame = 10 AND ball1 + ball2 >= 10 THEN
        ball1 + ball2 + ball3
      -- On a strike, we get the next two balls. This could
      -- be from the next frame, or include the first ball
      -- of the frame after that. Note that in frame 9, we will
      -- also look at the second ball from fram 10, rather than
      -- looking for a non-existent frame 11.
      WHEN ball1 = 10 THEN
        ball1 + next_ball_1 + (
          CASE WHEN next_ball_1 = 10 AND frame < 9 THEN
            next_next_ball_1
          ELSE
            next_ball_2
          END
        )
      -- In the case of a spare, grab the next ball. We already
      -- handled a spare on frame 10 above.
      WHEN ball1 + ball2 = 10 THEN
        ball1 + ball2 + next_ball_1
      -- Otherwise, it's just the two balls we bowled in this frame.
      ELSE
        ball1 + ball2
      END AS score
    FROM pin_counts
  )

  -- We have everything we need in the previous CTE, except that
  -- shows us the frame score, rather than the running tally.
  -- We need to do that in another window function here, but
  -- only calculate a value when the frame's wave function has
  -- collapsed (ie, it's score is known).
  SELECT
    frame_counts.*,
    CASE WHEN score IS NOT NULL THEN
      SUM(score) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
        ROWS UNBOUNDED PRECEDING
      )
    ELSE NULL END AS total
  FROM frame_counts
);

Again, I think this is a simpler query, and easier to read. But, I guess I wrote it.

We can insert the same data as used there, and look at our results:

-- Game 1
INSERT INTO bowling.frame VALUES
  (1, 1, 1, 7, 2, NULL),
  (1, 1, 2, 3, 7, NULL),
  (1, 1, 3, 6, 4, NULL),
  (1, 1, 4, 10, NULL, NULL),
  (1, 1, 5, 10, NULL, NULL),
  (1, 1, 6, 10, NULL, NULL),
  (1, 1, 7, 9, 1, NULL),
  (1, 1, 8, 10, NULL, NULL),
  (1, 1, 9, 8, 1, NULL),
  (1, 1, 10, 6, 3, NULL);

-- Game 2
INSERT INTO bowling.frame VALUES
  (2, 1, 1, 10, NULL, NULL),
  (2, 1, 2, 3, 7, NULL),
  (2, 1, 3, 10, NULL, NULL),
  (2, 1, 4, 6, 4, NULL),
  (2, 1, 5, 10, NULL, NULL),
  (2, 1, 6, 9, 1, NULL),
  (2, 1, 7, 10, NULL, NULL),
  (2, 1, 8, 8, 2, NULL),
  (2, 1, 9, 10, NULL, NULL),
  (2, 1, 10, 7, 3, 10);

-- Game 3
INSERT INTO bowling.frame VALUES
  (3, 1, 1, 10, NULL, NULL),
  (3, 1, 2, 10, NULL, NULL),
  (3, 1, 3, 10, NULL, NULL),
  (3, 1, 4, 10, NULL, NULL),
  (3, 1, 5, 10, NULL, NULL),
  (3, 1, 6, 10, NULL, NULL),
  (3, 1, 7, 10, NULL, NULL),
  (3, 1, 8, 10, NULL, NULL),
  (3, 1, 9, 10, NULL, NULL),
  (3, 1, 10, 10, 10, 10);
$ SELECT * FROM frame_score;

 game_id | player_id | frame | ball1 | ball2  | ball3  | score | total
---------+-----------+-------+-------+--------+--------+-------+-------
       1 |         1 |     1 |     7 |      2 | <NULL> |     9 |     9
       1 |         1 |     2 |     3 |      7 | <NULL> |    16 |    25
       1 |         1 |     3 |     6 |      4 | <NULL> |    20 |    45
       1 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |    75
       1 |         1 |     5 |    10 | <NULL> | <NULL> |    29 |   104
       1 |         1 |     6 |    10 | <NULL> | <NULL> |    20 |   124
       1 |         1 |     7 |     9 |      1 | <NULL> |    20 |   144
       1 |         1 |     8 |    10 | <NULL> | <NULL> |    19 |   163
       1 |         1 |     9 |     8 |      1 | <NULL> |     9 |   172
       1 |         1 |    10 |     6 |      3 | <NULL> |     9 |   181
       2 |         1 |     1 |    10 | <NULL> | <NULL> |    20 |    20
       2 |         1 |     2 |     3 |      7 | <NULL> |    20 |    40
       2 |         1 |     3 |    10 | <NULL> | <NULL> |    20 |    60
       2 |         1 |     4 |     6 |      4 | <NULL> |    20 |    80
       2 |         1 |     5 |    10 | <NULL> | <NULL> |    20 |   100
       2 |         1 |     6 |     9 |      1 | <NULL> |    20 |   120
       2 |         1 |     7 |    10 | <NULL> | <NULL> |    20 |   140
       2 |         1 |     8 |     8 |      2 | <NULL> |    20 |   160
       2 |         1 |     9 |    10 | <NULL> | <NULL> |    20 |   180
       2 |         1 |    10 |     7 |      3 |     10 |    20 |   200
       3 |         1 |     1 |    10 | <NULL> | <NULL> |    30 |    30
       3 |         1 |     2 |    10 | <NULL> | <NULL> |    30 |    60
       3 |         1 |     3 |    10 | <NULL> | <NULL> |    30 |    90
       3 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |   120
       3 |         1 |     5 |    10 | <NULL> | <NULL> |    30 |   150
       3 |         1 |     6 |    10 | <NULL> | <NULL> |    30 |   180
       3 |         1 |     7 |    10 | <NULL> | <NULL> |    30 |   210
       3 |         1 |     8 |    10 | <NULL> | <NULL> |    30 |   240
       3 |         1 |     9 |    10 | <NULL> | <NULL> |    30 |   270
       3 |         1 |    10 |    10 |     10 |     10 |    30 |   300
(30 rows)

Building the average scores for a player is likewise similar. Because I’m using a VIEW, I can jut reference that.

SELECT
  player_id,
  AVG(total) as average
FROM frame_score
WHERE frame=10
GROUP BY player_id;
 player_id |       average
-----------+----------------------
         1 | 227.0000000000000000
(1 row)

I’m fairly sure I’ve rewritten the constraints correctly, but may have missed some. Here are some of the condition tests that show invalid constraints:

$ INSERT INTO bowling.frame VALUES(1, 2, 0, 9, NULL, NULL);
ERROR:  value for domain frame_number violates check constraint "frame_number_check"
Time: 0.405 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 11, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.215 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, -1, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.218 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 3, NULL);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 1, 8, 3, null).
Time: 0.332 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 1, 1);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 1, 8, 1, 1).
Time: 0.392 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 2, 1);
ERROR:  new row for relation "frame" violates check constraint "ball_3_only_in_frame_10"
DETAIL:  Failing row contains (1, 2, 1, 8, 2, 1).
Time: 0.327 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 3, 1);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 10, 8, 3, 1).
Time: 0.340 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 2, 11);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.200 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, NULL, 10);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 10, 10, null, 10).
Time: 0.316 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 5, 10, 0, NULL);
ERROR:  new row for relation "frame" violates check constraint "ball_2_never_bowled_after_strike"
DETAIL:  Failing row contains (1, 2, 5, 10, 0, null).
Time: 0.307 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, 2, 9);
ERROR:  new row for relation "frame" violates check constraint "ball3_max_spare_or_strike"
DETAIL:  Failing row contains (1, 2, 10, 10, 2, 9).
Time: 0.323 ms

Finally, I rewrote the pretty printer. It’s not quite perfect (I don’t like how I get the plus signs at the newline character), but it is workable:

WITH symbols AS (
  SELECT
    game_id, player_id, frame,
    CASE WHEN ball1 = 10 THEN 'X' ELSE ball1::text END as ball1,
    CASE WHEN ball2 IS NULL THEN ' '
         WHEN ball1 + ball2 = 10 THEN '/'
         WHEN ball1 = 10 AND ball2 = 10 THEN 'X'
         ELSE ball2::text
         END as ball2,
    CASE WHEN ball3 IS NULL THEN ' '
    WHEN ball3 = 10 THEN 'X'
    WHEN ball3 + ball2 = 10 THEN '/'
    ELSE ball3::text
    END as ball3,
    lpad(total::text, 5, ' ') as total
  FROM
    frame_score
  ORDER BY game_id, player_id, frame
), grouped_data AS (
  SELECT
    game_id,
    player_id,
    array_agg(ball1) ball1,
    array_agg(ball2) ball2,
    array_agg(ball3) ball3,
    array_agg(total) total
  FROM
    symbols
  GROUP BY
    game_id, player_id
)
SELECT
  game_id,
  player_id,
  ball1[1] || ' | ' || ball2[1] || ' ' || chr(10) || total[1] AS "1",
  ball1[2] || ' | ' || ball2[2] || ' ' || chr(10) || total[2] AS "2",
  ball1[3] || ' | ' || ball2[3] || ' ' || chr(10) || total[3] AS "3",
  ball1[4] || ' | ' || ball2[4] || ' ' || chr(10) || total[4] AS "4",
  ball1[5] || ' | ' || ball2[5] || ' ' || chr(10) || total[5] AS "5",
  ball1[6] || ' | ' || ball2[6] || ' ' || chr(10) || total[6] AS "6",
  ball1[7] || ' | ' || ball2[7] || ' ' || chr(10) || total[7] AS "7",
  ball1[8] || ' | ' || ball2[8] || ' ' || chr(10) || total[8] AS "8",
  ball1[9] || ' | ' || ball2[9] || ' ' || chr(10) || total[9] AS "9",
  ball1[10] || ' | ' || ball2[10] || ' | ' || ball3[10] || ' ' || chr(10) || lpad(total[10], 9, ' ') AS "10"
FROM grouped_data;
 game_id | player_id |   1    |   2    |   3    |   4    |   5    |   6    |   7    |   8    |   9    |     10
---------+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+------------
       1 |         1 | 7 | 2 +| 3 | / +| 6 | / +| X |   +| X |   +| X |   +| 9 | / +| X |   +| 8 | 1 +| 6 | 3 |   +
         |           |     9  |    25  |    45  |    75  |   104  |   124  |   144  |   163  |   172  |       181
       2 |         1 | X |   +| 3 | / +| X |   +| 6 | / +| X |   +| 9 | / +| X |   +| 8 | / +| X |   +| 7 | / | X +
         |           |    20  |    40  |    60  |    80  |   100  |   120  |   140  |   160  |   180  |       200
       3 |         1 | X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X | X | X +
         |           |    30  |    60  |    90  |   120  |   150  |   180  |   210  |   240  |   270  |       300
(3 rows)

That will do for now. Corrections welcome!

slugify() for postgres (almost)

A recent discussion in #django suggested “what we need is a PG slugify function”.

The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.

Function slugify(value, allow_unicode=False).

  • Convert to ASCII if allow_unicode is false
  • Remove characters that aren’t alphanum, underscores, hyphens
  • Strip leading/trailing whitespace
  • Convert to lowercase
  • Convert spaces to hyphens
  • Remove repeated hyphens

(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).

We can write an SQL function that uses the Postgres unaccent extension to get pretty close:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\w\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";

$$ LANGUAGE SQL STRICT IMMUTABLE;

I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\w\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\s]+', '-', 'gi'
  ) FROM "normalized";

$$ LANGUAGE SQL STRICT IMMUTABLE;

To get the default value for the second argument, we can have an overloaded version with only a single argument:

CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;

Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:

BEGIN;

SELECT plan(7);

SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');

SELECT * FROM finish();

ROLLBACK;

And we get one failing test:

=# SELECT is(slugify('你好', true), '你好');

          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)

Time: 2.004 ms

It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.

Adding JSON operators to PostgreSQL

Don’t use the code in this post. Instead, read: Adding JSON(B) operators to PostgreSQL

Notably, the new JSONB data type in postgres is missing some of the features that the hstore data type has had for some time.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

However, using two awesome features of Postgres, it’s possible to add these operators in.

Firstly, python as a language in postgres:

matt=# CREATE LANGUAGE plpythonu;

Then, you can write functions in python, that import standard system libraries, like json.

CREATE OR REPLACE FUNCTION json_subtract(json_object json, key text) RETURNS json AS $body$
import json
data = json.loads(json_object)
data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

Finally, you can now overload the operator to get the syntactic sugar you want:

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text,
  PROCEDURE = json_subtract
);

Now, you can use the same syntax as for hstore:

matt=# SELECT '{"a":1, "b":2}'::json - 'a'::text;
 ?column?
----------
 {"b": 2}
(1 row)

It’s possible to repeat these for the other subtraction operators:

CREATE OR REPLACE FUNCTION json_subtract(json_object json, keys text[]) RETURNS json AS $body$
import json
data = json.loads(json_object)
for key in keys:
    data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text[],
  PROCEDURE = json_subtract
);

CREATE OR REPLACE FUNCTION json_subtract(json_object json, pairs json) RETURNS json AS $body$
import json
data = json.loads(json_object)
pairs_data = json.loads(pairs)
for key,value in pairs_data.items():
  if key in data and data[key] == value:
    data.pop(key)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = json,
  PROCEDURE = json_subtract
);

I’ll leave it as an exercise to write functions for the other operators.

Postgres Tree Shootout part 2: Adjacency List using CTEs

This is the second post in an ongoing series dealing with storing Hierarchical or Tree data structures in Postgres. You should read the Introduction if you haven’t already.

This post contains the queries that illustrate how an Adjacency List model can be used to represent a Hierarchical set of data, including data definitions, and the various operations that have been defined in the aforementioned introduction.

I’ve discussed Adjacency Lists in the past, but I’ll quickly recap why I think they are good.

  • They are conceptually simple to understand
  • They enforce referential integrity
  • They can be modelled with most ORMs without any extra infrastructure
  • Many of the operations are non-complex
  • Recursive queries allow us to perform the complex queries in reasonable time

To help build suspense (but more because I haven’t yet come up with a way to generate a nice reproducible, yet complex tree), this post may discuss the complexity of the queries, but will not contain any measurements.

Initial tree

Before each operation, our data will look like this (where parents point to children):

2014-11-26 11:27ZCanvas 1Layer 112345678910111213141516

We will assume reversion back to this structure after each operation: we could do this using a TRUNCATE followed by an INSERT; or we could run the operation in a transaction and rollback.

There may be a post which shows the effects of each of the queries below in a graphical form.

Table structure

Adjacency Lists are dead simple. Each node simply contains a reference to it’s parent node.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

We can insert our data using a single statement:

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

Inserting a single leaf node is simple. To insert a single node as a child of node 13, for example:

INSERT INTO nodes (parent_id) VALUES (13);

Inserting a new root node is slightly more complicated. In most ORMs, we would probably do it in two queries: one to create the new node, and a second to update the other root nodes to point to this one. We’ll see that below.

We can do slightly better in raw SQL: UPDATE our table with the result of an INSERT ... RETURNING that occurs inside a Common Table Expression (CTE, or WITH query).

WITH parent AS (
  INSERT INTO nodes(parent_id)
  VALUES (NULL)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = parent.node_id
FROM parent
WHERE parent_id IS NULL;

We can use the same pattern to insert an intermediate node in our tree. For instance, inserting a new node between nodes 11 and 12:

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (11)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.node_id = 12;

And our last insert, adding a child node, that gets it’s siblings as children. For instance, adding a new node under node 12, which gets all of node 12’s children as it’s children.

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (12)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.parent_id = 12;

All of these queries should perform relatively well: the CTE will be very simple (as it is no different to the single leaf insert), and the UPDATE should likewise be fairly simple: it needs to filter out which existing nodes do not need to be updated; and then it needs to update the remainder of the rows with the value pulled from the CTE.

This theoretically is only marginally more complex than just a simple UPDATE foo SET bar=baz WHERE quux IS NULL style query.

If we were using an ORM, we might need to do this in two queries: something like this (in Django):

# Insert new root node: all other root nodes now have this as a parent
new_node = Node.objects.create()
Node.objects.filter(parent=None).exclude(pk=new_node.pk).update(parent=new_node)
# Could possibly do as:
Node.objects.filter(parent=None).update(parent=Node.objects.create().pk)

# Insert new node, with a single child as it's child (and that child's previous
# parent as it's parent)
new_node = Node.objects.create(parent=old_node.parent)
old_node.parent = new_node
old_node.save()

# Insert new node, with children of that node's parent now children of the node.
new_node = parent_node.children.create()
parent_node.children.exclude(pk=new_node.pk).update(parent=new_node)
# Again, may be able to do:
parent_node.children.update(parent=parent_node.children.create().pk)

Note the required exclusion of the newly created node: we don’t have to do this in the CTE versions, as that doesn’t “exist” at the time the other part of the query runs.

Removals

Removing a single leaf node is no different than removing a row from a normal table. Removing node 9, for instance:

DELETE FROM nodes WHERE node_id = 9;

Because the information about parent-child relationships is stored in the child, we do not need to do anything else to maintain the tree.

To remove a single root node (in this case, node 1), and promote all children to root nodes themselves, we can do two queries:

UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;
DELETE FROM nodes WHERE node_id = 1;

It may be possible to do this in a single query, similar to the CTE queries above, but I’m not sure of the benefit.

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 1
)
UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;

The same pattern can be used for removing a node, and attaching it’s children to it’s parent. Here, we will remove node 2, and attach it’s children (4 and 5) as children of it’s parent, node 1:

UPDATE nodes
SET parent_id = (SELECT parent_id FROM nodes WHERE node_id = 2)
WHERE parent_id = 2;

DELETE from nodes WHERE node_id = 2;

This is a place where using a CTE might make things clearer - especially if we have the node-to-be-deleted’s id, but not it’s parent:

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 2
  RETURNING node_id, parent_id
)
UPDATE nodes
SET parent_id = deleted.parent_id
FROM deleted
WHERE nodes.parent_id = deleted.node_id;

Righto, now we are up to the traditionally “hard” things for an Adjacency List to perform. Dealing with removing an arbitrary depth of (sub)tree.

We’ll need to create a recursive CTE, and delete according to that. Let’s just select from that initially, so we can see what the CTE data will look like:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT * FROM tree;
 node_id | ancestors
---------+------------
       1 | {}
      10 | {}
       2 | {1}
       3 | {1}
      11 | {10}
       4 | {1,2}
       5 | {1,2}
       6 | {1,3}
       7 | {1,3}
      12 | {10,11}
      13 | {10,11}
       8 | {1,2,4}
      14 | {10,11,12}
      15 | {10,11,12}
      16 | {10,11,12}
       9 | {1,2,4,8}
(16 rows)

Coolio. So, on to our operations. Let’s remove the subtree starting with node 2. I’ll hide the CTE, since it will be the same for quite a few of these operations:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
) OR node_id = 2;

The query is identical for a full tree (node 1 and descendants):

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 1 = ANY(tree.ancestors)
) OR node_id = 1;

And it’s nearly identical for just the descendants of a given node. Here, for all of node 2’s descendants, but not that node itself:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
);

Moves

Because the relationship is stored purely on the child element, moving around trees and subtrees is very easy. We can start with moving subtree starting with 3 to underneath node 4:

UPDATE nodes
SET parent_id = 4 WHERE node_id = 3;

Nothing surprising there. Similarly, the query is identical for moving a leaf to a different parent, a root node into a tree, and turning a subtree into a full tree (making that node a root node).

UPDATE nodes SET parent_id = 6 WHERE node_id = 5;
UPDATE nodes SET parent_id = 8 WHERE node_id = 10;
UPDATE nodes SET parent_id = NULL WHERE node_id = 2;

The final move: all of node’s children to a different node is almost as simple:

UPDATE nodes SET parent_id = 5 WHERE parent_id = 12;

This seems to be a situation where Adjacency Lists are really good. None of these queries are any more complex than the simplest UPDATE you could think of.

Fetches

Using the same CTE, we can perform our fetches. We may need to extend it to deal with depths, but since the ancestors column contains ancestors starting with the root node, we could count stuff in there. Let’s see how it goes.

Descendants

Fetching all descendants of a given node just means we want to see if the node occurs at all in each row’s ancestors. To get all of node 10’s descendants:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree WHERE 10 = ANY(tree.ancestors);

However, we could improve this by starting with just the node we care about, or more specifically, it’s children:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree;

Obviously, this is far less generic, but it is also significantly less complex. For starters, it only builds up the part of the tree we care about, and then just returns the node ids, rather than building up the whole tree, and then discarding the parts that are not required.

The same code can be used for determining the number of descendants, but with a COUNT(node_id) in the final query.

To get our depth-limited query, we can approach from two directions. To get the subtree to depth 2 from above:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
  AND cardinality(tree.ancestors) < 2
)
SELECT node_id FROM tree;

To do the same in the more generic form have to look at how close the desired node is to the end of the ancestors array:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id
FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

(Note that this is a bug fix on the original version of this post).

Ancestors

Fetching ancestors from our generic CTE is a bit simpler, because that data is already part of the query:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

To do the equivalent of the hand-built CTE, we would need to start with the node, and build back the other way. It’s getting late here, so I can’t think of a way to do this right now that doesn’t get stuck doing infinite recursion.

The count query is an interesting one: we can just remove the need to unnest, and take the cardinality:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT cardinality(ancestors) FROM tree WHERE node_id = 15;

The depth query is a little trickier. We want to know the ancestors of node 15, up to a depth of 2. If our ancestors array was in the reverse order, we should be able to unnest and limit.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, nodes.parent_id || tree.ancestors
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15 LIMIT 2;

We can do this because a node only has one parent: so limiting the number of ancestors (when sorted nearest ancestor first) is the same as limiting the depth.

Special queries

Fetching all leaf nodes is just a matter of excluding those that have a relationship to another node as it’s parent:

SELECT node_id FROM nodes
WHERE node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Trick for new players: if you leave off the WHERE clause in that sub-query, you won’t get any matches!

Fetching the number of leaf nodes is trivial.

Fetching root nodes (or the number of) is simpler than leaf nodes:

SELECT node_id FROM nodes
WHERE parent_id IS NULL;

Fetching non-leaf nodes, and non-root nodes is just negations of the two queries above:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

SELECT node_id FROM nodes WHERE parent_id IS NOT NULL;

And the non-leaf, non-root nodes just combines these two queries:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
) AND parent_id IS NOT NULL;

As an aside: there is also the inverse of this: nodes which are isolated (root AND leaf nodes):

SELECT node_id FROM nodes
WHERE parent_id IS NULL
AND node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Well, that’s our operations. Most of them are really simple. For anything that requires us to fetch or delete a whole subtree, we needed to revert to recursive CTEs, and for some of the other operations, using a CTE makes it simpler (and easier to understand).

Next, we will look at an alternative to the CTE operations, using a recursive view. From there, we should be able to look at a trigger-based approach that materializes our tree (node, ancestors) data in a table, and keeps it up to date. That’s, as I hinted, getting close to a Materialized Path approach, but keeps the conceptual simplicity of the Adjacency List, and hopefully prevents possible issues relating to referential integrity.

Avoiding SQL Antipatterns using Django (and Postgres)

The book SQL Antipatterns is one of my favourite books. I took the opportunity to reread it on a trip to Xerocon in Sydney, and as usual it enlightened me to thing I am probably doing in my database interactions.

So, I’m going to look at these Antipatterns, and discuss how you can avoid them when using Django. This post is intended to be read with each chapter of the book. I’ve used the section headings, but instead of the chapter headings, I’ve used the Antipattern headings. They are still in the same order, though.

It seems the printed version of this book is on sale now: I’m tempted to buy a few extra copies for gifts. Ahem, cow-orkers.

Logical Database Design Antipatterns

Format Comma-Separated Lists

This one is pretty simple: use a relation instead of a Comma Separated field. In the cases described in the book, a ManyToManyField is in fact simpler than a Comma Separated field. Django gets a gold star here, both in ease of use, but also in documentation about relations.

However, there may be times when a relation is overkill, and a real array is better. For instance, when storing data related to which days of the week are affected by a certain condition, it may make sense to store it in this way.

But we can do better than a simple Comma Separated field. Storing the data in a Postgres Array means we can rely on the database to validate the data, and allows searching. Similarly, we could store it in JSON, too.

I’ve maintained a JSONField for Django, although it’s not easily queryable. However, an ArrayField is coming in Django 1.8. There are alternatives already available if you need to use one now. I’ve got a project to mostly backport the django.contrib.postgres features to 1.7: django-postgres.

Things like JSON, Array and Hstore are a better solution than storing other-delimitered values in a straight text column too. With Django 1.7, it became possible to have lookups, which can leverage the DBMS’ ability to query these datatypes.

Always Depend on One’s Parent

Read chapter online.

Straight into a trickier one! And, Django’s documentation points out how to create his type of relation, but does not call out the possible issues. This book is worth it for this section alone.

So, how do we deal with trees in Django?

We can use django-mptt. This gives us (from what I can see) the “Nested Sets” pattern outlined in the book, but under the name “Modified Preorder Tree Traversal”.

I’m quite interested in the idea of using a Closure Table, and there are a couple of projects with quite different approaches to this:

  • django-ctt: uses a Model class you inherit from.
  • django-ct: better documented, but uses an unusual pattern of a pseudo-manager-thing.

Knowing me, I’m probably going to spend some time building a not-complete implementation at some point.

Update: Whilst I haven’t built an implementation of a Closure Table, I did implement recursive queries for an Adjacency List.

One Size Fits All

Using a field id for all tables by default is probably one of the biggest mistakes I think Django makes. And, as we shall see, we can’t yet avoid them, for at least a subset of situations.

Indeed, Django can use any single column for the primary key, and doesn’t require the use of a key column of name id. So, in my mind, it would have been better to use the <tablename>_id, as suggested in the book. Especially since you may also access the primary key attribute using the pk shortcut.

class Foo(models.Model):
    foo_id = models.AutoField(primary_key=True)

However, it’s not currently possible to do composite primary keys (but may be soon), which makes doing the best thing for a plain ManyToManyField possible: indeed, you don’t control that table anyway, and if you remove the id column (and create a proper primary key), things don’t work. In practice, you can just ignore this issue, since you (mostly) don’t deal with this table, or the objects from it.

So, assuming we are changing the id column into the name suggested in the book, what does that give us?

Nothing, until we actually need to write raw SQL code, and specifically code that joins multiple tables.

Then, we are able to use a slightly less verbose way of defining the join, and not worry about duplicate columns named id:

SELECT * FROM foo_foo JOIN foo_bar USING (foo_id);

I’m still not sure if it’s actually worthwhile doing this or not. I’m going to start doing it, just to see whether there are any drawbacks (already found one in some of my own code, that hard-coded an id field), or any great benefits.

Leave out the Constraints

Within Django, it’s more work to create relations without the relevant constraints, and it’s not possible to create a table without a primary key, so we can just pass this one by with a big:

smile and wave, boys

Use a Generic Attribute Table

Again, it’s possible to create this type of a monstrosity in Django, but not easy. A better solution, if your table’s requirements change is to use migrations (included in Django 1.7), or a more flexible store, like JSON or Hstore. This also has the added advantage of being a column, rather than a related table, which means you can fetch it in one go, simply. Similarly, with Postgres 9.3, you can do all sorts of querying, and even more in 9.4.

Document or key stores are no substitute for proper attributes, but they do have their uses.

The other solution is to use Model inheritance, which Django does well. You can choose either abstract or concrete table inheritance, and with something like django-model-utils, even get some nice features like fetching only the subtypes when fetching a queryset of superclass models.

Use Dual-Purpose Foreign Key

Unfortunately, Django comes with a built-in way to do this: so-called Generic Relations.

Using this, it’s possible to have an association from a given model instance to any other object of any other model class.

“You may find that this antipattern is unavoidable if you use an object-relational programming framework […]. Such a framework may mitigate the risks introduced by Polymorphic Associations by encapsulating application logic to maintain referential integrity. If you choose a mature and reputable framework, then you have some confidence that its designers have written the code to implement the association without error.”

I guess we’ll just have to rely on the fact Django is a mature and reputable framework.

In all reality, I’ve used this type of relation once: for notifications that need to be able to refer to any given object. It’s also possible to use, say, a tagging app that had generic relations. But, I’m struggling to think of too many situations where it would be better than a proper relation.

I’ve also come across it in django-reversion, and running queries against objects from it is a pain in the arse.

Create Multiple Columns

Interestingly, the example for this Antipattern is the example I just used above: tags. And, this type of situation should be done in a better way: a proper relation, or perhaps an Array type. It all depends how good your database is at querying arrays. django.contrib.postgres makes this rather easy:

class Post(models.Model):
    name = models.CharField(...)
    tags = ArrayField(models.CharField(...), blank=True)

Post.objects.filter(tags__contains=['foo'])

What may not be so easy is getting all of the tags in use. This may be possible: I just haven’t thought of a way to do this yet. A nice syntax might be:

Post.objects.aggregate(All('tags'))

The SQL you might be able to use to get this could look like:

SELECT
  array_agg(distinct t) AS tags
FROM (
  SELECT unnest(tags) FROM posts
) t;

I’m not sure if there’s a better way to get this data.

Clone Tables or Columns

I can’t actually see that doing this in Django would be easy, or likely. It’s gotten me interested in some method of seamlessly doing Horizontal Partitioning as a method of archiving old data, and perhaps moving it to a different database. Specifically, moving old audit data into a separate store may become necessary at some point.

Partitioning using a multi-tenancy approach using Postgres’ schemata is another of my interests, and I’ve been working on a django-specific way to do this: django-boardinghouse. Note, this is a partial-segmentation approach, where some tables are shared, but others are per-schema.

Physical Database Design Antipatterns

Use FLOAT Data Type

Just don’t.

There’s a DecimalField, and no reason not to use it.

Specify Values in the Column Definition

The example the book uses is to define check constraints on a given question. Django’s approach is a bit different: the valid choices are defined in the column definition, but can be changed in code at any time. Any existing values that are no longer valid are fine, but any attempt to save an object will require it to have one of the newly valid choices.

This is both better and worse than the problem described in the book. There’s no way (short of a migration) to change the existing data, but maybe that’s actually just better.

Again, the best solution is just to use a related field, but in some cases this is indeed overkill: specifically if values are unlikely to change.

Assume You Must Use Files

I’m still 50-50 on this one. Basically, storing binary files in your database (a) makes the database much bigger, which means it takes longer to back it up (and restore it), and (b) means that it’s harder to do things like use the web server, rather than the application server, to serve static files (even those user-supplied, that must be authenticated).

The main disadvantage, of not having backups, is purely an operations issue.

The secondary disadvantage: the lack of transactionality is also easily solved: don’t delete files (unless necessary), and don’t overwrite them. If you really must, then use a Postres NOTIFY delete-file <filepath> or similar, and have a listener that handles that.

The other disadvantage, about SQL privilidges is mostly moot under Django anyway, as you are always running as the one database user.

Using Indexes Without a Plan

Indexes are fairly tangiential to an ORM: I’m going to pass over this one without too much comment. I’ve been doing a fair bit of index-level optimisations on my production database lately, in an effort to improve performance. Mostly, it’s better to optimise the query, as the likely targets for indexes probably already have them.

Query Antipatterns

Use Null as an Ordinary Value, or Vice Versa.

Python has it’s own None type/value, and using it in queries basically converts it into NULL. Django is a little annoying how at times it stores empty strings instead of NULL in string fields. I was playing around with making these into proper NULLs, but it seemed to create other problems.

At least there is no established pattern to use other values instead of NULL.

Reference Non-grouped Columns

Since I’m dealing with Postgres, I understand this one is not much of an issue. Your query will fail if you build it wrong. Which should be the way databases work.

Sort Data Randomly

Read this chapter online.

The problem of how to fetch a single random instance from a Model comes up every now and then on IRC, indeed, it did again last weekend. Unsurprisingly, I provided a link to this chapter.

One solution that is presented in the book is to select a single row, using a random offset:

import random
# Note: the initial version of this would fail since queryset.count()
# is the number of elements, randint(a, b) includes the value 'b',
# and queryset[b] would be out of range.
index = random.randint(0, queryset.count() - 1)
instance = queryset.all()[index]

This, converts to the query:

SELECT * FROM "table" LIMIT 1 OFFSET %s;

However, without an ordering, I believe this will still do a complete table seek. Instead, you want to order on a column with an index. Like the primary key:

instance = queryset.order_by('pk')[index]

It does take two queries, but sometimes two queries is better than one. Obviously, if your table was always going to be small, it may be better to do the random ordering:

instance = queryset.order_by('?')[0]

Pattern Matching Predicates

I’m sorry to say Django makes it far too easy to do this:

queryset.filter(foo__contains='bar')

Becomes something like:

SELECT * FROM "table" WHERE "table"."foo" LIKE '%bar%';

In many cases, this will be fine, but as you can imagine, you may get surprising matches, or performance may really suck.

Using Postgres’s full-text search is relatively simple: you can quite easily make a custom field that handles this, and with Django 1.7 or later, you can even create your own lookups:

from django.db import models


class TSVectorField(models.Field):
    def db_type(self, connection):
        return 'tsvector'


class TSVectorMatches(models.lookups.BuiltinLookup):
    lookup_name = 'matches'
    def process_lhs(self, qn, connection, lhs=None):
        lhs = lhs or self.lhs
        return qn.compile(lhs)

    def get_rgs_op(self, connection, rhs):
        return '@@ to_tsquery(%s)' % rhs

TSVectorField.register_lookup(TSVectorMatches)

Then, you are able, on a correctly defined field, able to do:

queryset.filter(foo__matches='bar')

Which roughly translates to:

SELECT * FROM "table" WHERE (foo @@ to_tsquery('bar'));

It’s actually a little more complicated than that, but I have a working prototype at https://bitbucket.org/schinckel/django-postgres/. There is a field class, but also an example within the search sub-app.

Clearly, you’ll want to be creating the right indexes.

Solve a Complex Problem in One Step

By their very nature, ORMs tend to make this a little less easy to do. Because you don’t normally write custom code, this scenario is less common than you might see in a normal SQL access.

However, with Django, it is possible to write over-complicated queries, but also to use things like .raw(), and .extra() to write “Spaghetti Queries”.

However, it is worth noting that with judicious use of these features, you can indeed write queries that perform exceptionally well, indeed, far better than the ORM is able to generate for you. It’s also worth noting that you can write really, really bad queries that take a very long time, just using the ORM (without even doing things like N+1 queries for related objects).

Indeed, the “how to recognize” section of this chapter shows the biggest red flag I have noticed lately: “Just stick another DISTINCT in there”.

I’ve seen, first-hand how a .distinct() can cause a query to take a very long period of time. Removing the need for a distinct by removing the join, and instead using subqueries, caused a query that was taking around 17 seconds with a given data set to suddenly take less than 200ms.

That alone has forced me to reconsider each and every time I use .distinct() in my code (and probably explains why our code that runs queries against django-reversion) performs so horribly.

A Shortcut That Gets You Lost

I’ve used, in my SQL snippets in this post, the shortcut that is mentioned here: SELECT * FROM .... Luckily, Django doesn’t use this shortcut, and instead lists out every column it expects to see.

This has a really nice side-effect: if your database tables have not been migrated to add that new column, then whenever you try to run any queries against that table, you will have an error. Which is much more likely to happen immediately, rather than at 3am when that column is first actually used.

Application Development Antipatterns

Store Password in Plain Text

There is no, I repeat, no reason you should ever be doing this. It’s a cardinal sin, and Django has a great authentication and authorisation framework, that you can extend however you need it.

As noted in the legitimate uses section: if you are accessing a third-party system, you may need to store the password in a readable format. In this case, something like Oauth, if available, may make things a little safer.

Execute Unverified Input As Code

Read this chapter online.

Most of the risks of SQL Injection are mitigated when you use an ORM like Django’s. Of course, if you write .raw() or .extra() queries that don’t properly escape user-provided data, then you may still be at risk. .extra() in particular has arguments that allow you to pass an iterable of parameters, which will then be correctly escaped as they are added to the query.

Filling in the Corners

Educate your manager if (s)he thinks it’s a bad thing to have non-contiguous primary keys. Transaction rollbacks, deleted objects: there’s all sorts of reasons why there may be gaps.

Making Bricks Without Straw

It goes without saying that you should have error handling within your python code.

Make SQL a Second-Class Citizen

This is kind-of the point of an ORM: to remove from you the need to deal with creating complex queries in raw SQL.

Your Django models are the documentation of your table structure, or documentation can be generated from them. Your migrations files show the changes that have been made over time. Naturally, both of these will be stored in your Source Code Management system.

Clearly, as soon as you are doing anything in raw SQL, then you should follow the practices you do with the rest of your code.

Testing in-database is something I am a little bit interested in. As I move more code into the database (often for performance reasons, sometimes because it’s just fun), it would be nice to have tests for these functions. I have a long list of things in my Reading List about Postgres Unit Testing. Perhaps I’ll get around to them at some point. Integrating these with the Django test runner would be really neat.

The Model Is an Active Record

Django’s use of the Active Record is slightly different to Rails. In Rails, the column types in the database control what attributes are on the model, but in Django, the python object is the master. I think this is more meaningful, because it means that everything you need to know about an object is in the model definition: you don’t need to follow the migrations to see what attributes you have.

I do like the concept of a Domain Model: it’s an approach I’ve lightly tried in the past. Perhaps it is an avenue I’ll push down further at some point. In some ways, Django’s Form classes allow you to encapsulate this, but mostly business logic still lives on our Model classes.

Summary

So, how did Django do?

Pretty good, I’d say. The ones that were less successful either don’t really matter most of the time (primary key column is always called id, choices defined in the model), or you don’t really need to use them (Generic Relations, searching using LIKE %foo%, using raw SQL).

We do fall down a bit with files stored in the database, and fat models, but I would argue that those patterns work just fine, at least for me right now.

Postgres VIEW meet Django Model

Postgres VIEWs are a nice way to store a subset of a table in a way that can itself be queried, or perhaps slightly or radically changing the shape of your table. It has a fairly simple syntax:

CREATE VIEW "foo" AS
SELECT "bar", "baz", "qux"
FROM "corge"
WHERE "grault" IS NULL;

You may use any valid SELECT query as the source of a VIEW, including one that contains UNION or UNION ALL. You can use this form to create a view that takes two similarly formatted tables and combines them into one logical table. Note that for a UNION to work, the columns (and column types) must be identical between the two parts of the query. A UNION will do extra work to ensure all rows are unique: UNION ALL may perform better, especially if you know your rows will be unique (or you need duplicates).

By default, a Postgres VIEW is dynamic, and read-only. With the use of the CREATE MATERIALIZED VIEW form, it’s possible to have a cached copy stored on disk, which requires an UPDATE MATERIALIZED VIEW "viewname" in order to cause an update.

It’s also possible to create a writeable VIEW, but I’m not going to discuss those now.


There is a feature of Django that makes in really simple to use a VIEW as the read-only source of a Django Model subclass: managed = False.

Given the VIEW defined above, we can write a Model that will allow us to query it:

from django.db import models

class Foo(models.Model):
    bar = models.CharField()
    baz = models.CharField()
    qux = models.CharField()

    class Meta:
      managed = False

Psycopg2 also has the ability to automatically convert values as it fetches them, so you don’t even really need to set the fields as the correct type: but you will probably want to where possible, as an aid to code readability.

In my case, I was returning a two-dimensional ARRAY of TIMESTAMPTZ, but didn’t want to have to include the full code for an ArrayField. So, I just defined it as a CharField, and psycopg2 just gave me the type of object I actually wanted anyway.

There is one little catch, and the code above will not quite work.

Django requires a primary key, even though in this case it makes no sense. You could define any field as a primary key, include a relevant key field from the parent model, or even a dummy value that is the same on every row. Relying on the same psycopg2 trick as above, you could use <tablename>-<id> so as to ensure uniqueness, even though that is not normally a valid value for a Django AutoField.

You probably need to be a little careful here, as if you are doing comparisons, Django will test __class__ and pk for testing equality, so you could hurt yourself if you aren’t careful.

You may also want to prevent write access at the Django level. Overriding save() and delete() on the Model class would be a good start, as well as writing a custom Manager/QuerySet that does the same. You could raise an exception that makes sense, like NotImplemented, or just leave it as a database error.

Postgres Tree Shootout part 1: Introduction.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.


Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.


Operations

I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.

Insertions

  • Insert a single leaf node
  • Insert a single root node (all existing root nodes will then point to this)
  • Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)
  • Insert a single node partway through the tree, with this node’s parent’s existing children all becoming children of the new node.

Removals

  • Remove a single leaf node
  • Remove a single root node (all children of this are promoted to root nodes)
  • Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
  • Remove a subtree (a single non-root node and it’s descendants)
  • Remove a whole tree (a single root node and all descendants)
  • Remove all descendants of a specific node (but not the node itself)

Moves

  • Move a subtree from one parent to another
  • Move a single leaf node to a different parent
  • Move a root node into a tree
  • Make a subtree into a tree (turn a node into a root node).
  • Move all children of a node to a different parent

Fetches

  • Fetch all descendants of a given node
  • Fetch the number of descendants of a given node
  • Fetch descendants of a given node to a given depth
  • Fetch the number of descendants of a given node to a given depth
  • Fetch all ancestors of a given node
  • Fetch the number of ancestors of a given node
  • Fetch ancestors of a given node to a given depth
  • Fetch the number of ancestors of a given node to a given depth

    I don’t think this makes any sense.

  • Fetch all leaf nodes
  • Fetch the number of leaf nodes
  • Fetch all root nodes
  • Fetch the number of root nodes
  • Fetch all non-leaf nodes
  • Fetch the number of non-leaf nodes
  • Fetch all non-root nodes
  • Fetch the number of non-root nodes
  • Fetch all non-root, non-leaf nodes
  • Fetch the number of non-root, non-leaf nodes

Long Live Adjacency Lists

I recently wrote about the excellent book SQL Antipatterns, and in it briefly discussed the tree structures. I’ve been thinking about trees in Postgres a fair bit lately, and a discussion on #django gave me further incentive to revisit this topic.

The book discusses four methods of storing a tree in a database.

Adjacency Lists, apart from the inability to grab a full or partial tree easily, are the simplest to understand. The child object stores a reference to it’s parent. Because this is a foreign key, then it always maintains referential integrity. Fetching a parent is simple, as is fetching all children, or siblings. It’s only when you need to fetch an arbitrary depth that things become problematic, unless you use a recursive query. More on that later.

Postgres has an extension called ltree, which provides an implementation of a Path Enumeration, but one thing that really bothers me about this type of structure is the lack of referential integrity. In practice, I’m not sure what having this ltree structure would give you over simply storing the keys in an ARRAY type. Indeed, if Postgres ever gets Foreign Key constraints for ARRAY elements (which there is a patch floating around for), this becomes even more compelling. It also seems to me that restructuring a tree becomes a bit more challenging in a Path Enumeration than an Adjacency List.

Nested Sets are also interesting, and maintain FK integrity, but require potentially rewriting lots of data when any change is made to the tree. They aren’t that appealing to me: perhaps I fail to see any big advantages of this structure.

Finally, Closure Tables are perhaps the most interesting. This stores all ancestor-descendant relationships, rather than just parent-child, which again requires more work when adding or removing. Again, Referential Integrity is preserved, but it seems like there is lots of work to maintain them.

From all of these, there are some significant advantages, in my mind, to using a simple Adjacency List.

  1. Adding a new row never requires you to alter any other rows in the database.
  2. Moving a subtree to a different location only requires a change to one now in the database.
  3. It’s never possible to end up with Referential Integrity errors: the database will prevent you from deleting a parent row whilst it still has children (or, you may set it to CASCADE or SET NULL the children automatically).
  4. It’s conceptually very simple. Everyone understands the parent-child relationship (and all of the relationships that follow, like grand-parents). It’s a similar mental model to how we think about our own families, except we don’t have exactly one parent.

There is really only two things that are hard to do:

  1. Given a node, select all descendants of that node.
  2. Given a node, select all ancestors of that node.

But, as we shall see shortly, it is possible to do these in Postgres using some nice recursive features.

There is another advantage to using an Adjacency List, this time from the perspective of Django. We can do it without needing to install a new package, or subclass or mix-in a new Model:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = models.ForeignKey('self', null=True, blank=True, related_name='children')

That’s it.

Now, using Postgres, it’s possible to build a recursive VIEW that contains the whole tree:

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
    SELECT node_id, '{}'::integer[]
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

We can then query this (replacing %s with the parent node id):

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

Or, if you want to select for multiple parents:

SELECT node_id
FROM nodes INNER JOIN tree USING (node_id)
WHERE [%s, %s] && ancestors;

This actually performs relatively well, and, if it doesn’t do well enough, we could create a MATERIALIZED VIEW based on the recursive view, and query that instead (refreshing it whenever we need to, perhaps using a trigger).

CREATE MATERIALIZED VIEW tree_m AS (SELECT * FROM tree);

CREATE FUNCTION refresh_tree_m() RETURNS trigger AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW tree_m;
  END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trig_refresh_tree_m AFTER TRUNCATE OR INSERT OR UPDATE OR DELETE
ON nodes FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_tree_m();

This view is still not perfect though. We can improve it to allow us to limit depth of ancestry:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth) AS (
    SELECT node_id, '{}'::integer[], 0
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT n.node_id, t.ancestors || n.parent_id, t.depth + 1
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
);

SELECT node_id FROM nodes INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors) AND depth < %s;

This is pretty good now, but if we have cycles in our tree (yes, this makes it technically no longer a tree, but a graph, of which a tree is a restricted kind), this query will run forever. There’s a pretty neat trick to prevent cycles:

CREATE RECURSIVE VIEW tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
);

You don’t need to use the cycle column outside of the view.

The query used for the view can be repurposed into a Common Table Expression, which is basically a way of defining a view that only exists for the query we are executing (but will itself only be executed once, even if it’s referred to lots of times):

WITH RECURSIVE tree (node_id, ancestors, depth, cycle) AS (
    SELECT node_id, '{}'::integer[], 0, FALSE
    FROM nodes WHERE parent_id IS NULL
  UNION ALL
    SELECT
      n.node_id, t.ancestors || n.parent_id, t.depth + 1,
      n.parent_id = ANY(t.ancestors)
    FROM nodes n, tree t
    WHERE n.parent_id = t.node_id
    AND NOT t.cycle
) SELECT n.* FROM nodes n INNER JOIN tree USING (node_id)
WHERE %s = ANY(ancestors);

You can see that this syntax basically defines the view before running the real query.


Looking at it from the perspective of Django, we would like to be able to spell a query something like:

Node.objects.filter(parent__recursive=node)
Node.objects.filter(parent__recursive__in=nodes)
Node.objects.filter(children__recursive__contains=node)

The problem we have with using the CTE immediately above is that we don’t have access to the full query at the time we are dealing with the filter. We could define the view prior to running the query (perhaps in a migration), but this means it’s more than just a simple field: although with the new migrations framework, we could make it so that makemigrations automatically adds a migration operation to create the recursive view.

The other solution is to still use a recursive CTE, but use it as a subquery. I’m still investigating if this will have poor performance characteristics.

Here is an implementation of doing just that:

from django.db import models

SQL = """
WITH RECURSIVE "tree" ("{pk}", "related", "cycle") AS (
    SELECT "{pk}", ARRAY[]::integer[], FALSE
    FROM "{table}" WHERE "{fk}" IS NULL
  UNION ALL
    SELECT a."{pk}", b."related" || a."{fk}", a."{fk}" = ANY(b."related")
    FROM "tree" b, "{table}" a
    WHERE a."{fk}" = b."{pk}" AND NOT b."cycle"
) {query}
"""


class RecursiveRelation(models.ForeignKey):
    def __init__(self, *args, **kwargs):
        super(RecursiveRelation, self).__init__('self', *args, **kwargs)

    def get_lookup_constraint(self, constraint_class, alias, targets, sources, lookups,
                              raw_value):
        if lookups[0] == 'recursive':
            # With a recursive query, we want to build up a subquery that creates
            # the simplest possible tree we can deal with.
            data = {
                'fk': self.get_attname(),
                'pk': self.related_fields[0][1].get_attname(),
                'table': self.model._meta.db_table
            }
            if lookups[-1] == 'in':
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveInLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseInLookup(raw_value, **data)
            else:
                if targets[0] == self:
                    raw_value = ForeignKeyRecursiveLookup(raw_value, **data)
                else:
                    raw_value = ForeignKeyRecursiveReverseLookup(raw_value, **data)

            # Rewrite some variables so we get correct behaviour.

            # This makes the query based on the original table, not the joined version,
            # which was skipping a level of relation. It still joins the table, however,
            # which can't be great for performance
            alias = self.model._meta.db_table
            # This sets the correct lookup type, removing the recursive bit.
            lookups = lookups[1:] or ['exact']

        return super(RecursiveRelation, self).get_lookup_constraint(
            constraint_class, alias, targets, sources, lookups, raw_value
        )


class ForeignKeyRecursiveLookup(object):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s = ANY("related")'

    def __init__(self, value, **kwargs):
        self.value = value
        self.data = kwargs

    def get_compiler(self, *args, **kwargs):
        return self

    def as_subquery_condition(self, alias, columns, qn):
        sql = SQL.format(
            query=self.query.format(**self.data),
            **self.data
        )
        return '%s.%s IN (%s)' % (qn(alias), qn(self.data['pk']), sql), [self.value]


class ForeignKeyRecursiveInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT "{pk}" FROM "tree" WHERE %s && "related"'


class ForeignKeyRecursiveReverseLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" = %s'


class ForeignKeyRecursiveReverseInLookup(ForeignKeyRecursiveLookup):
    query = 'SELECT unnest("related") FROM "tree" WHERE "{pk}" IN %s'

If we were to use an existing view (created using a migration), then the structure would be largely the same: simply the SQL constant would be simpler:

SQL = 'SELECT {pk} FROM "{table}_{fk}_tree" WHERE {where}'

But then we would need some sort of name mangling for the view: I’ve suggested <tablename>_<fk-name>-tree.

I went into this exercise thinking it would be simple: just write a Lookup (or Transform), but it seems that Foreign Keys in django have a fair bit of special casing. There’s also a bit of lax code around the names of lookups: I may polish it up at some stage.

For now, though, you use it as:

class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    parent = RecursiveRelation(null=True, blank=True, related_name='children')

In-Database Audit Trail

I’ve been thinking about audit trails, object versioning and the like a bit lately. We have situations where it’s actually important to be able to know what changes were made, when, and by whom.

The tool we are currently using in Django for this is django-reversion, but I’m not that happy with it. Part of the problem is that it adds a fair amount of overhead into each request, which we have noticed, and also that querying the data it produces is not simple, unless you are just looking for changes made to a specific object.

For our case, we want to know all the changes made to a set of objects (that may not all be of the same type), made within a given range, where there is (eventually) a foreign key relationship back to our Company model.

Part of the problem is just that: a custom solution could store the Company relationship explicitly, but also, we really care about what the changes were, rather than what the object looks like after the changes. Well, technically, we want both.

However, it has gotten me thinking about other ways to solve this problem.

In most DBMSs, it’s possible to get a trigger function to run when an update happens to each row, which makes it possible to get access to this information. Indeed, googling “Postgres audit trigger” pops up some hits that are highly relevant. I looked at Audit trigger 91plus, since it’s “improved”. It has some really nice features like storing the audit data in an hstore column, which means you can query it.

However, one place where this type of trigger breaks down is that it’s not generally possible to get the “application user” associated with a request, only the “database user”, which for our system (and most other web applications) is fixed for all access.

One way to get around this might be to, at the start of every database transaction, inject a call that creates a temporary table, with the extra bits of data that you want to log, and then use that in the trigger function.

CREATE TEMP TABLE IF NOT EXISTS
  "_app_user" (user_id integer, ip_address inet);

Then we need to add (or update) the one row that will contain our data. We must ensure that we only ever have one row in this table.

UPDATE _app_user SET user_id=%s, ip_address=%s;
INSERT INTO _app_user (user_id, ip_address)
  SELECT %s, %s WHERE NOT EXISTS (SELECT * FROM _app_user);

This code will ensure that the first statement (UPDATE) will affect all rows in the table (of which there will be at most one), and the second statement (INSERT ... SELECT ... WHERE NOT EXISTS ...) will only create a new row if there are no rows currently in the table.

It’s up to you to then pass the correct data to this. I’m currently looking at doing this using Django middleware, although I suspect this may fall down using the newer transaction handling, as otherwise we could have just ensured our middleware ran after the TransactionMiddleware. It may be possible to do it with a custom database backend, but it needs to somehow get access to the request object (which contains the user, and the ip address). Obviously, you could log other data about the request, too.

The final part of the puzzle is to inject this data into the row that will be used for the audit table entry. I modified the table definition so it included columns for the data I wanted: app_user_id and app_ip_address.

Then, inside the actual trigger function, after the audit_row object has been created, but before it is written to the table, we inject the data we want.

We need to be a little careful, as it’s possible the table does not exist:

BEGIN
  PERFORM 
    n.nspname, c.relname 
  FROM
    pg_catalog.pg_class c 
  LEFT JOIN 
    pg_catalog.pg_namespace n
  ON n.oid = c.relnamespace
  WHERE
    n.nspname like 'pg_temp_%' 
  AND
    c.relname = '_app_user';

  IF FOUND THEN
    FOR r IN SELECT * FROM _app_user LIMIT 1 LOOP
      audit_row.app_user_id = r.user_id;
      audit_row.app_ip_address = r.ip_address;
    END LOOP;
    END IF;
END;

This checks to see if the _app_user table exists in any of the valid temporary table namespaces, and if so, grabs the first (and only, from above) entry, using the values to update the row.

This function then works: if there is a temporary table with this name, it uses these fields when creating the audit, if not, it creates the audit row with empty values. This would mean that some audit statements may not contain proper user data, but in the case of Django, it’s possible to make changes outside of the request-response cycle. You could require that a user starting a shell session authenticates with a valid django username+password, but that still leaves management commands. I guess you could have a system account, but leaving these entries blank is like an explicit system user.

I haven’t got any production code using anything like this: I’d still want to test that it works as expected as part of the request, and would want to build up some method of querying it. There’s probably no reason you couldn’t do the table definition as a Django model (and indeed, have the function definition as a migration).

Adding JSON(B) operators to PostgreSQL

This is a followup/replacement post for this older post. In it, I discussed a way to use PL/python to perform function (and in turn operations) on JSON data. Don’t do that, it’s way too slow.

It’s actually possible, using raw SQL (and some handy functions) to perform some of the operations that are “missing” from the JSON(B) datatypes in Postgres.

In all cases, I’ll work with just JSONB as the input/output formats. In practice, when I first wrote these, I wrote up to four versions of each (JSON+JSON, JSON+JSONB, JSONB+JSON, JSONB+JSONB). I believe it’s possible to write polymorphic versions of these functions, but I’m not that familiar with them just yet.

We’ll start with concatenation: basically joining two JSONB objects into one. This is a good one to start with, as operands must be either JSON or JSONB, so all forms of this function are the same, just with different functions or casting of operands.

Before I begin, I’ll mention a post I saw on Michael Paquier’s excellent blog: Manipulating jsonb data by abusing of key uniqueness. In it, Michael uses the json_object_agg function to build up a JSON object from a query:

CREATE FUNCTION "json_append" (jsonb, jsonb) RETURNS jsonb AS $$

WITH json_union AS
(
  SELECT * FROM jsonb_each($1)
  UNION ALL
  SELECT * FROM jsonb_each($2)
)
SELECT json_object_agg(key, value) FROM json_union;

$$ LANGUAGE SQL;

This seems like a good idea, however it actually performs around two orders of magnitude slower than just iterating over the objects and building them up using string_agg and ||:

CREATE FUNCTION "json_concatenate" (jsonb, jsonb) RETURNS jsonb AS $$

SELECT ('{' || string_add(to_json("key")::text || ':' ||"value", ',') || '}')::jsonb
FROM (
  SELECT * FROM jsonb_each($1) UNION ALL SELECT * jsonb_each($2)
);

$$ LANGUAGE SQL;

CREATE OPERATOR || (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_concatenate
);

It seems that this is not just because of the Common Table Expression (although, using a CTE does make the second function perform just as poorly).

=# SELECT * FROM BENCHMARK(10000,
  'jsonb_concatenate(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)',
  'jsonb_append(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)'
  );
           code              |  runtime   | corrected
-----------------------------+------------+------------
 [Control]                   | 0.00550699 |          0
 jsonb_concatenate(...)      | 0.00652981 | 0.00102282
 jsonb_append(...)           |   0.484099 |   0.478592

My attempt at reimplementing the json_object_agg aggregate function in SQL proved even slower. Not at all surprising.

The next one we will tackle is the - operator from Hstore.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

We can reimplement these for JSON: first as functions, and then create operators using those if we want. What is interesting is that we use the same construction pattern for our output JSONB object, however, I don’t seem to be able to figure out how to extract this out into another function.

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" TEXT
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ? "remove" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json") -- Until this function is added!
    WHERE "key" <> "remove"),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text,
  PROCEDURE = jsonb_subtract
);

You’ll notice that there’s a test for if the key to remove is in the object first: this should be much faster in the situation where it doesn’t appear, as we then don’t need to recreate the object.

The other forms are quite similar, but the WHERE clause varies, and the initial test varies or is removed:

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "keys" TEXT[]
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ?| "keys" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "key" <> ALL ("keys")),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text[],
  PROCEDURE = jsonb_subtract
);

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT COALESCE(
  (
    SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
    FROM jsonb_each("json")
    WHERE NOT
      ('{' || to_json("key")::text || ':' || "value" || '}')::jsonb <@ "remove"
      -- Note: updated using code from http://8kb.co.uk/blog/2015/01/16/wanting-for-a-hstore-style-delete-operator-in-jsonb/
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_subtract
);

There is also the #= operator for hstore: but this seems to be record #= hstore, rather than hstore #= hstore. I’m not sure how to implement this, but I can implement a jsonb #= jsonb function:

CREATE OR REPLACE FUNCTION "jsonb_update_only_if_present"(
  "json" jsonb,
  "other" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$

SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM (SELECT * FROM jsonb_each("json") UNION ALL SELECT * FROM jsonb_each("other")) AS a
     WHERE "json" ? "key"::text
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR #= (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_update_only_if_present
);

Finally, there are operators %% (convert to array of alternating key/value pairs), and %# (convert to 2-dimensional array of keys, values).

I haven’t figured out a way to create these yet either.


So, that’s what I’ve got so far. Obviously, these will be slower than pure C implementations however, we can run some benchmarks against the hstore operators for comparisons.

=# SELECT * FROM BENCHMARK(100000,
   ' ''{"a": 1, "b":2}''::jsonb || ''{"a":2}''::jsonb',
   ' ''a=>1, b=>2''::hstore || ''a=>2''::hstore'
   );
 
                     code                      |  runtime  |  corrected
-----------------------------------------------+-----------+-------------
 [Control]                                     | 0.0837061 |           0
  '{"a": 1, "b":2}'::jsonb || '{"a":2}'::jsonb | 0.0842681 | 0.000561953
  'a=>1, b=>2'::hstore || 'a=>2'::hstore       | 0.0843148 | 0.000608683
(3 rows)

Whoa. That’s actually pretty good!

And similar for subtract:

=# SELECT * FROM BENCHMARK(100000,
  ' ''{"a": 1, "b":2}''::jsonb - ''a''::text ',
  ' ''a=>1, b=>2''::hstore - ''a''::text '
);
                  code                  |  runtime  | corrected
----------------------------------------+-----------+------------
 [Control]                              | 0.0818689 |          0
  '{"a": 1, "b":2}'::jsonb - 'a'::text  |  0.083431 | 0.00156212
  'a=>1, b=>2'::hstore - 'a'::text      |  0.083159 | 0.00129008
(3 rows)

I might have to run these with some more complicated queries, and compare the results.