Best match in hierarchy using Postgres

The other day, I had an issue where I needed to prevent duplicates in a subset of nullable columns for the purpose of modelling different regional breakdowns; I came up with quite a neat solution that uses JSONB functions.

What I didn’t realise until later was how this same function can be used to detect the “best match” of an object within these rows.

In this case, the best match is the row that has the most specificity (ie, the highest number of non-null values), but does not have any values that differ in the target object.

For instance, given the data (and the function defined in the aforementioned post):

{"country": "AU"}
{"country": "AU", "state": "SA"}
{"country": "NZ"}

We would want an object that is within Australia (AU) to match the first row, unless it also was within South Australia (SA), when it should match the second row.

We can use the json containment operators to query this match:

SELECT geo_unique_indexer(geo_table, 'id')
  FROM geo_table
 WHERE geo_unique_indexer(geo_table, 'id') <@ '{"country": "AU", "state": "SA"}';

Okay, that’s really close: it shows all “parent” data:

{"country": "AU"}
{"country": "AU", "state": "SA"}

So, we need to ensure that only the most specific one of these is actually returned:

SELECT geo_unique_indexer(geo_table, 'id')
  FROM geo_table
 WHERE geo_unique_indexer(geo_table, 'id') <@ '{"country": "AU", "state": "SA"}';
 ORDER BY (SELECT count(*) FROM JSONB_OBJECT_KEYS(geo_unique_indexer(geo_table, 'id'))) DESC
 LIMIT 1;

And, presto, we have our best match:

{"country": "AU", "state": "SA"}

Postgres multi-column unique index

Consider something that requires a “best match” facility, based on the geographic location. There may be a set of behaviours that apply when an object is in Australia, and a different set within the USA. If this is all you have, then it’s simple to just have those behaviours based on a single country column in your database.

But now consider that in some circumstances, it’s possible to have behaviours within a single state of Australia that should override the nationwide behaviours. For instance, Western Australia may just use the Australian rules, but South Australia has a custom set of rules. Likewise, California may use a custom set of rules, but Hawaii just uses the inherited USA rules.

To ensure that a consistent set of rules is applied, there could only be a single “Australia” rule set, but there could be an “Australia+South Australia” set stored. This could also be modelled, in a slightly more complicated fashion, by having two columns, country, and state, and requiring country to be unique if state is NULL, and both country and state to be unique together.

Now, consider that there could be a whole stack more levels, and some of these levels have different names (and different orders in the hierarchy) in different countries. For instance, New Zealand has regions, but Great Britian has Nations (England, Northern Ireland, Scotland and Wales), but also counties (or council areas in Scotland).

Django Localflavor has a whole bunch of these divisions, but there are only nine “names” for these:

  • nation
  • state
  • province
  • provincial district
  • prefecture
  • region
  • county
  • department
  • municipality
CREATE TABLE geo_table (
  id SERIAL PRIMARY KEY,
  country VARCHAR(2) NOT NULL,
  county TEXT,
  department TEXT,
  municipality TEXT,
  nation TEXT,
  prefecture TEXT,
  province TEXT,
  provincial_district TEXT,
  region TEXT,
  state TEXT
);
INSERT INTO geo_table(country, state)
VALUES ('AU', NULL), ('AU', 'SA');

So, to prevent having data in our database that could result in an inability to determine which rule set we should use for a given object, we want to have the following:

A unique constraint that applies to all columns that are not NULL

It could be possible to model this with a number of different unique constraints, but that doesn’t seem like heaps of fun to deal with.

However, postgres allows us to define “functional” indices, that is, they apply a function to some columns from the row, and use that as the stored value in the index.

We can write an expression that takes the whole row, turns it into JSON, removes the primary key column, and then any columns that are NULL, and use that as the index.

SELECT jsonb_strip_nulls(
         to_jsonb(geo_table) - 'id'::TEXT
       )
  FROM geo_table;
  {"country": "AU"}
  {"country": "AU", "state": "SA"}

So, that looks pretty good, but what’s going on here?

In this case, we take the row, and turn it into a JSONB object, remove the primary key (in this case, called id), and then remove any keys that have a NULL value.

Now, we can’t just use this expression in an index, because to_jsonb (and row_to_json) are not IMMUTABLE, because the value they turn a TIMESTAMPTZ into depends upon something that may not be fixed. However, because our data will only contain strings, which can be turned reliably into the same JSON value regardless of anything else. So, we’ll need to create a function:

CREATE OR REPLACE FUNCTION geo_unique_indexer(anyelement, pk TEXT)
RETURNS jsonb AS $$

    SELECT jsonb_strip_nulls(to_jsonb($1) - pk);

$$
LANGUAGE SQL IMMUTABLE;

You’ll notice that I had to use anyelement, as it’s not possible to have an SQL function refer to a record type. I’ve also made it configurable what the primary key column is, so it doesn’t depend on using the column “id”.

Finally, we can create the index:

CREATE UNIQUE INDEX geo_unique_match ON geo_table(
  geo_unique_indexer(geo_table, 'id')
);

Okay, that looks good. Now we can try to insert some values that should not be permitted. We already have a row with just Australia:

INSERT INTO geo_table(country, state) VALUES ('AU', NULL);
ERROR:  duplicate key value violates unique constraint "geo_unique_match"
DETAIL:  Key (geo_unique_indexer(geo_table.*, 'id'::text))=({"country": "AU"}) already exists.

…and another row with both Australia and SA.

INSERT INTO geo_table(country, state) VALUES ('AU', 'SA');
ERROR:  duplicate key value violates unique constraint "geo_unique_match"
DETAIL:  Key (geo_unique_indexer(geo_table.*, 'id'::text))=({"country": "AU", "state": "SA"}) already exists.

Finally, this one should succeed:

INSERT INTO geo_table(country, state) VALUES ('AU', 'NSW');

Excellent.

More JSONB querying

Occasionally, I get emails from people regarding specific queries in Postgres, usually because I have blogged about JSONB querying before.

Today, I got one: rather than just reply, I thought I’d blog about how queries could be written to solve this problem.

Our table can be a single column with JSONB data for the purposes of this.

CREATE TABLE priority (data JSONB);

We also need a bit of data to query:

INSERT INTO priority (data) VALUES (
'{
  "id": "02e32a14-904c-4153-a32b-fe8d1f1bbbe1",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 7}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "b33498b2-32f6-4575-b2cd-9e9a1ae2059d",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 4}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "a2d327d2-7668-4dc0-ae1d-d6144130e3ec",
  "entity": "activity",
  "fields": {
    "object": [],
    "subject": [
      {"val": "Object", "priority": 1},
      {"val": "Target", "priority": 7}
    ]
  }
}'), (
'{
  "id": "3bc8b536-00af-4fc7-881e-b88b620ac436",
  "entity": "activity",
  "fields": {
    "object": [
      {"val": "Object", "priority": 9}
    ]
  }
}'
);

The problem requires selection of the data rows where priority is greater than 5.

I’ve extended the data provided: I’m not sure if there will be multiple “fields”, but I assume so. I also assume that a match for any priority within a subject field will be required.

Lets start with a simpler version: get the records where the first fields->subject priority is greater than 5 (I’ll return just the id, to make it simpler):

SELECT data->'id'
  FROM priority
 WHERE (data#>>'{fields,subject,0,priority}')::INTEGER > 5;

 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"

This uses the #>> operator - which does a path lookup, and returns a string value, that we then cast to an integer for the comparison. Note that the path lookup differs from normal Postgres’ array indexing, in that it uses 0 as the first index, rather than 1.

But, we want to query for all rows where any subject field has a priority greater than 5.

We’ll want to use the jsonb_array_elements (which is the JSONB equivalent of unnest). We can use that to get the fields themselves:

SELECT jsonb_array_elements(data#>'{fields,subject}') FROM priority;

Note this uses the #> operator, because we still want JSONB data:

       jsonb_array_elements
──────────────────────────────────
 {"val": "Subject", "priority": 7}
 {"val": "Subject", "priority": 4}
 {"val": "Object", "priority": 1}
 {"val": "Target", "priority": 7}
(4 rows)

We can get a bit further too:

SELECT jsonb_array_elements(data#>'{fields,subject}')->'priority' FROM priority;

Indeed, we can get all the way to our boolean test:

SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER > 5 FROM priority;
 ?column?
─────────
 t
 f
 f
 t
(4 rows)

But we want the data rows themselves, not just the matching subject field, and this is not that useful. So, we can use the fact that jsonb_array_elements returns a set, and use that as a subquery in our WHERE clause, using the value operator ANY() construct:

SELECT data->'id'
  FROM priority
 WHERE 5 < ANY(SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER)

This means that we want only the records where 5 is less than any of the priority values in subject fields.

                ?column?
────────────────────────────────────────
 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"
 "a2d327d2-7668-4dc0-ae1d-d6144130e3ec"

I hope this helps, Paulo!

Adding JSON(B) operators to PostgreSQL

This is a followup/replacement post for this older post. In it, I discussed a way to use PL/python to perform function (and in turn operations) on JSON data. Don’t do that, it’s way too slow.

It’s actually possible, using raw SQL (and some handy functions) to perform some of the operations that are “missing” from the JSON(B) datatypes in Postgres.

In all cases, I’ll work with just JSONB as the input/output formats. In practice, when I first wrote these, I wrote up to four versions of each (JSON+JSON, JSON+JSONB, JSONB+JSON, JSONB+JSONB). I believe it’s possible to write polymorphic versions of these functions, but I’m not that familiar with them just yet.

We’ll start with concatenation: basically joining two JSONB objects into one. This is a good one to start with, as operands must be either JSON or JSONB, so all forms of this function are the same, just with different functions or casting of operands.

Before I begin, I’ll mention a post I saw on Michael Paquier’s excellent blog: Manipulating jsonb data by abusing of key uniqueness. In it, Michael uses the json_object_agg function to build up a JSON object from a query:

CREATE FUNCTION "json_append" (jsonb, jsonb) RETURNS jsonb AS $$

WITH json_union AS
(
  SELECT * FROM jsonb_each($1)
  UNION ALL
  SELECT * FROM jsonb_each($2)
)
SELECT json_object_agg(key, value) FROM json_union;

$$ LANGUAGE SQL;

This seems like a good idea, however it actually performs around two orders of magnitude slower than just iterating over the objects and building them up using string_agg and ||:

CREATE FUNCTION "json_concatenate" (jsonb, jsonb) RETURNS jsonb AS $$

SELECT ('{' || string_add(to_json("key")::text || ':' ||"value", ',') || '}')::jsonb
FROM (
  SELECT * FROM jsonb_each($1) UNION ALL SELECT * jsonb_each($2)
);

$$ LANGUAGE SQL;

CREATE OPERATOR || (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_concatenate
);

It seems that this is not just because of the Common Table Expression (although, using a CTE does make the second function perform just as poorly).

=# SELECT * FROM BENCHMARK(10000,
  'jsonb_concatenate(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)',
  'jsonb_append(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)'
  );
           code              |  runtime   | corrected
-----------------------------+------------+------------
 [Control]                   | 0.00550699 |          0
 jsonb_concatenate(...)      | 0.00652981 | 0.00102282
 jsonb_append(...)           |   0.484099 |   0.478592

My attempt at reimplementing the json_object_agg aggregate function in SQL proved even slower. Not at all surprising.

The next one we will tackle is the - operator from Hstore.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

We can reimplement these for JSON: first as functions, and then create operators using those if we want. What is interesting is that we use the same construction pattern for our output JSONB object, however, I don’t seem to be able to figure out how to extract this out into another function.

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" TEXT
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ? "remove" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json") -- Until this function is added!
    WHERE "key" <> "remove"),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text,
  PROCEDURE = jsonb_subtract
);

You’ll notice that there’s a test for if the key to remove is in the object first: this should be much faster in the situation where it doesn’t appear, as we then don’t need to recreate the object.

The other forms are quite similar, but the WHERE clause varies, and the initial test varies or is removed:

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "keys" TEXT[]
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ?| "keys" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "key" <> ALL ("keys")),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text[],
  PROCEDURE = jsonb_subtract
);

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT COALESCE(
  (
    SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
    FROM jsonb_each("json")
    WHERE NOT
      ('{' || to_json("key")::text || ':' || "value" || '}')::jsonb <@ "remove"
      -- Note: updated using code from http://8kb.co.uk/blog/2015/01/16/wanting-for-a-hstore-style-delete-operator-in-jsonb/
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_subtract
);

There is also the #= operator for hstore: but this seems to be record #= hstore, rather than hstore #= hstore. I’m not sure how to implement this, but I can implement a jsonb #= jsonb function:

CREATE OR REPLACE FUNCTION "jsonb_update_only_if_present"(
  "json" jsonb,
  "other" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$

SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM (SELECT * FROM jsonb_each("json") UNION ALL SELECT * FROM jsonb_each("other")) AS a
     WHERE "json" ? "key"::text
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR #= (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_update_only_if_present
);

Finally, there are operators %% (convert to array of alternating key/value pairs), and %# (convert to 2-dimensional array of keys, values).

I haven’t figured out a way to create these yet either.


So, that’s what I’ve got so far. Obviously, these will be slower than pure C implementations however, we can run some benchmarks against the hstore operators for comparisons.

=# SELECT * FROM BENCHMARK(100000,
   ' ''{"a": 1, "b":2}''::jsonb || ''{"a":2}''::jsonb',
   ' ''a=>1, b=>2''::hstore || ''a=>2''::hstore'
   );
 
                     code                      |  runtime  |  corrected
-----------------------------------------------+-----------+-------------
 [Control]                                     | 0.0837061 |           0
  '{"a": 1, "b":2}'::jsonb || '{"a":2}'::jsonb | 0.0842681 | 0.000561953
  'a=>1, b=>2'::hstore || 'a=>2'::hstore       | 0.0843148 | 0.000608683
(3 rows)

Whoa. That’s actually pretty good!

And similar for subtract:

=# SELECT * FROM BENCHMARK(100000,
  ' ''{"a": 1, "b":2}''::jsonb - ''a''::text ',
  ' ''a=>1, b=>2''::hstore - ''a''::text '
);
                  code                  |  runtime  | corrected
----------------------------------------+-----------+------------
 [Control]                              | 0.0818689 |          0
  '{"a": 1, "b":2}'::jsonb - 'a'::text  |  0.083431 | 0.00156212
  'a=>1, b=>2'::hstore - 'a'::text      |  0.083159 | 0.00129008
(3 rows)

I might have to run these with some more complicated queries, and compare the results.

Python, postgres and jsonb

I maintain a json field for django, and was working today on getting the new (1.7+) lookup code to play nicely: in order for this to happen, you basically need to be running Postgres 9.4, and using a jsonb column. Otherwise, querying kind-of sucks.

After a significant amount of work, where I drift backwards and forwards between having old and new code working, I had an idea.

Some time ago I discovered that psycopg2 has really nice support for some custom types. Indeed, it’s super-easy to get it to handle UUID and json data. But it seems that it hasn’t yet been made to work with jsonb.

However, the registration process for handling the data makes it possible to do so, and trivial, since the serialised form will be essentially identical for both:

psycopg2.extras.register_json(
    conn_or_curs=None,
    globally=False,
    loads=None,
    oid=None,
    array_oid=None
)

Note the last two arguments. We can trick psycopg2 into using jsonb instead of json.

Is your database, execute:

SELECT oid, typarray FROM pg_type WHERE typname = 'jsonb';
-- oid      --> 3802
-- typarray --> 3807

(Syntax highlighting fail means I can’t include the actual results).

Your values may vary (I’m really not sure), but you’ll simply need to call register_json with the first two:

register_json(oid=3802, array_oid=3807)

Now, assuming you have a jsonb column, when you fetch data from it, it will already be turned into python objects.

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
>>> conn = psycopg2.connect("")
>>> cur = conn.cursor()
>>> cur.execute("SELECT * FROM jsonb_test;")
>>> data = cur.fetchone()
>>> data
(1, '{}')
>>> from psycopg2.extras import register_json
>>> register_json(oid=3802, array_oid=3807)
(<psycopg2._psycopg.type 'JSON' at 0x101713418>, <psycopg2._psycopg.type 'JSONARRAY' at 0x101721208>)
>>> cur.execute("SELECT * FROM jsonb_test;")
>>> data = cur.fetchone()
>>> data
(1, {})