Adding JSON(B) operators to PostgreSQL

This is a followup/replacement post for this older post. In it, I discussed a way to use PL/python to perform function (and in turn operations) on JSON data. Don’t do that, it’s way too slow.

It’s actually possible, using raw SQL (and some handy functions) to perform some of the operations that are “missing” from the JSON(B) datatypes in Postgres.

In all cases, I’ll work with just JSONB as the input/output formats. In practice, when I first wrote these, I wrote up to four versions of each (JSON+JSON, JSON+JSONB, JSONB+JSON, JSONB+JSONB). I believe it’s possible to write polymorphic versions of these functions, but I’m not that familiar with them just yet.

We’ll start with concatenation: basically joining two JSONB objects into one. This is a good one to start with, as operands must be either JSON or JSONB, so all forms of this function are the same, just with different functions or casting of operands.

Before I begin, I’ll mention a post I saw on Michael Paquier’s excellent blog: Manipulating jsonb data by abusing of key uniqueness. In it, Michael uses the json_object_agg function to build up a JSON object from a query:

CREATE FUNCTION "json_append" (jsonb, jsonb) RETURNS jsonb AS $$

WITH json_union AS
(
  SELECT * FROM jsonb_each($1)
  UNION ALL
  SELECT * FROM jsonb_each($2)
)
SELECT json_object_agg(key, value) FROM json_union;

$$ LANGUAGE SQL;

This seems like a good idea, however it actually performs around two orders of magnitude slower than just iterating over the objects and building them up using string_agg and ||:

CREATE FUNCTION "json_concatenate" (jsonb, jsonb) RETURNS jsonb AS $$

SELECT ('{' || string_add(to_json("key")::text || ':' ||"value", ',') || '}')::jsonb
FROM (
  SELECT * FROM jsonb_each($1) UNION ALL SELECT * jsonb_each($2)
);

$$ LANGUAGE SQL;

CREATE OPERATOR || (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_concatenate
);

It seems that this is not just because of the Common Table Expression (although, using a CTE does make the second function perform just as poorly).

=# SELECT * FROM BENCHMARK(10000,
  'jsonb_concatenate(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)',
  'jsonb_append(''{"a": 1, "b":2}''::jsonb, ''{"a":2}''::jsonb)'
  );
           code              |  runtime   | corrected
-----------------------------+------------+------------
 [Control]                   | 0.00550699 |          0
 jsonb_concatenate(...)      | 0.00652981 | 0.00102282
 jsonb_append(...)           |   0.484099 |   0.478592

My attempt at reimplementing the json_object_agg aggregate function in SQL proved even slower. Not at all surprising.

The next one we will tackle is the - operator from Hstore.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

We can reimplement these for JSON: first as functions, and then create operators using those if we want. What is interesting is that we use the same construction pattern for our output JSONB object, however, I don’t seem to be able to figure out how to extract this out into another function.

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" TEXT
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ? "remove" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json") -- Until this function is added!
    WHERE "key" <> "remove"),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text,
  PROCEDURE = jsonb_subtract
);

You’ll notice that there’s a test for if the key to remove is in the object first: this should be much faster in the situation where it doesn’t appear, as we then don’t need to recreate the object.

The other forms are quite similar, but the WHERE clause varies, and the initial test varies or is removed:

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "keys" TEXT[]
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE WHEN "json" ?| "keys" THEN COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM jsonb_each("json")
    WHERE "key" <> ALL ("keys")),
  '{}'
)::jsonb
ELSE "json"
END
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = text[],
  PROCEDURE = jsonb_subtract
);

CREATE OR REPLACE FUNCTION "jsonb_subtract"(
  "json" jsonb,
  "remove" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT COALESCE(
  (
    SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
    FROM jsonb_each("json")
    WHERE NOT
      ('{' || to_json("key")::text || ':' || "value" || '}')::jsonb <@ "remove"
      -- Note: updated using code from http://8kb.co.uk/blog/2015/01/16/wanting-for-a-hstore-style-delete-operator-in-jsonb/
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR - (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_subtract
);

There is also the #= operator for hstore: but this seems to be record #= hstore, rather than hstore #= hstore. I’m not sure how to implement this, but I can implement a jsonb #= jsonb function:

CREATE OR REPLACE FUNCTION "jsonb_update_only_if_present"(
  "json" jsonb,
  "other" jsonb
)
  RETURNS jsonb
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$

SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key")::text || ':' || "value", ',') || '}')
     FROM (SELECT * FROM jsonb_each("json") UNION ALL SELECT * FROM jsonb_each("other")) AS a
     WHERE "json" ? "key"::text
  ),
  '{}'
)::jsonb
$function$;

CREATE OPERATOR #= (
  LEFTARG = jsonb,
  RIGHTARG = jsonb,
  PROCEDURE = jsonb_update_only_if_present
);

Finally, there are operators %% (convert to array of alternating key/value pairs), and %# (convert to 2-dimensional array of keys, values).

I haven’t figured out a way to create these yet either.


So, that’s what I’ve got so far. Obviously, these will be slower than pure C implementations however, we can run some benchmarks against the hstore operators for comparisons.

=# SELECT * FROM BENCHMARK(100000,
   ' ''{"a": 1, "b":2}''::jsonb || ''{"a":2}''::jsonb',
   ' ''a=>1, b=>2''::hstore || ''a=>2''::hstore'
   );
 
                     code                      |  runtime  |  corrected
-----------------------------------------------+-----------+-------------
 [Control]                                     | 0.0837061 |           0
  '{"a": 1, "b":2}'::jsonb || '{"a":2}'::jsonb | 0.0842681 | 0.000561953
  'a=>1, b=>2'::hstore || 'a=>2'::hstore       | 0.0843148 | 0.000608683
(3 rows)

Whoa. That’s actually pretty good!

And similar for subtract:

=# SELECT * FROM BENCHMARK(100000,
  ' ''{"a": 1, "b":2}''::jsonb - ''a''::text ',
  ' ''a=>1, b=>2''::hstore - ''a''::text '
);
                  code                  |  runtime  | corrected
----------------------------------------+-----------+------------
 [Control]                              | 0.0818689 |          0
  '{"a": 1, "b":2}'::jsonb - 'a'::text  |  0.083431 | 0.00156212
  'a=>1, b=>2'::hstore - 'a'::text      |  0.083159 | 0.00129008
(3 rows)

I might have to run these with some more complicated queries, and compare the results.

Querying JSON in Postgres

Yesterday, I discovered how you can enable jsonb in postgres/psycopg2.

Today, I experimented around with how to query the data in json columns. There is documentation, but it wasn’t initially clear to me how the different operations worked.

CREATE TABLE json_test (
  id serial primary key,
  data jsonb
);

INSERT INTO json_test (data) VALUES 
  ('{}'),
  ('{"a": 1}'),
  ('{"a": 2, "b": ["c", "d"]}'),
  ('{"a": 1, "b": {"c": "d", "e": true}}'),
  ('{"b": 2}');
  

So far, so good. Let’s see what’s in there, to check:

SELECT * FROM json_test;
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(5 rows)  

Super. Let’s have a go at filtering those results. There are several operators that we can use (and we’ll soon see why we chose jsonb).

Equality

Only available for jsonb, we can test that two JSON objects are identical:

SELECT * FROM json_test WHERE data = '{"a":1}';
 id | data 
----+------
  1 | {"a": 1}
(1 row)

Containment

Again, jsonb only, we can see if one JSON object contains another. In this case, containment means “is a subset of”.

SELECT * FROM json_test WHERE data @> '{"a":1}';

Give me all objects that contain the key "a", with the value 1 associated with that key:

 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

Containment goes both ways:

SELECT * FROM json_test WHERE data <@ '{"a":1}';

In this case, we can see that the query object is a superset of the empty object, as well as matching exactly to object 2.

 id |   data   
----+----------
  1 | {}
  2 | {"a": 1}
(2 rows)

Key/element existence

The last batch of jsonb only operators: we can test for the existence of a key (or an element of type string in an array, but we’ll get to those later).

SELECT * FROM json_test WHERE data ? 'a';

Give me all objects that have the key a.

 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(3 rows)

We can also test for objects that have any of a list of keys:

SELECT * FROM json_test WHERE data ?| array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(4 rows)

And, as you may expect, for objects that have all of the keys:

SELECT * FROM json_test WHERE data ?& array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

Key-path traversal

We can also filter records that have a matching key-path. In simple cases, using the containment operators might be simpler, but in more complex situations, we would need to use these. These operations can also be used to extract a value, although at this stage I’m only interested in using them as part of a WHERE clause.

SELECT * FROM json_test WHERE data ->> 'a' > '1';

Give me all the records where the value of the element associated with key a is greater than 1. Notice the need to use a text value, rather than a number. I’m still investigating how this will play out.

 id |           data            
----+---------------------------
  3 | {"a": 2, "b": ["c", "d"]}
(1 row)

We can also do comparisons between primitives, objects and arrays:

SELECT * FROM json_test WHERE data -> 'b' > '1';
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(3 rows)

So, it seems that arrays and objects sort greater than numbers.

We can also look deeper down the path:

SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"';

Give me objects where element b has a child object that has element c equal to the string "d". Neat.

 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}

There are also versions of these operators that return a text, rather than a json object. In the case of the last query, that means we don’t need to compare to a JSON object (in the case where we actually want a string).

SELECT * FROM json_test WHERE data #>> '{b,c}' = 'd';
 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(1 row)

Don’t cross the streams…

So, all good so far. We can query stuff, and this same stuff can be used to index jsonb columns, too.

However, the more astute reader may have noticed that I’ve been dealing with json data that has an object as it’s root. This needn’t be the case: arrays are also valid json, indeed so are any of the allowable atoms:

SELECT 
  'null'::json, 
  'true'::json, 
  'false'::json, 
  '2'::json,
  '1.0001'::json,
  '"abc"'::json, 
  '1E7'::jsonb;

Note the last one is a jsonb, which converts to canonical form:

 json | json | json  | json |  json   | json  |  jsonb   
------+------+-------+------+---------+-------+----------
 null | true | false | 2    | 1.00001 | "abc" | 10000000
(1 row)

Note also that a json null is different to an SQL NULL.

So, what happens when we start storing objects of mixed “type” in a json column?

I’m glad you asked.

INSERT INTO json_test (data) 
VALUES ('[]'), ('[1,2,"a"]'), ('null'), ('1E7'), ('"abc"');

SELECT * FROM json_test;
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
  6 | []
  7 | [1, 2, "a"]
  8 | null
  9 | 10000000
 10 | "abc"
(10 rows)

So far, so good. We can store those objects. And query?

Equality testing works fine:

SELECT * FROM json_test WHERE data = '{"a":1}';
SELECT * FROM json_test WHERE data = 'null';

Containment, too works as expected.

SELECT * FROM json_test WHERE data @> '{"a":1}';
SELECT * FROM json_test WHERE data <@ '{"a":1}';

Key and element existence perform reliably: perhaps surprisingly, the one query will match elements in an array, as well as keys in an object.

SELECT * FROM json_test WHERE data ? 'a';
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  7 | [1, 2, "a"]
(4 rows)
SELECT * FROM json_test WHERE data ?| array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
  7 | [1, 2, "a"]
(5 rows)
SELECT * FROM json_test WHERE data ?& array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

But, as soon as we start doing key or element ‘get’, we hit a problem:

SELECT * FROM json_test WHERE data ->> 'a' > '1';

ERROR: cannot call jsonb_object_field_text 
       (jsonb ->> text operator) on an array

We can still use the key-path traversal, though, unless we have scalar values:

SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"';
ERROR:  cannot call extract path from a scalar
SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"' AND id < 8;
 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(1 row)

Note the syntax for a key path: it only allows for strings (which json keys must be), or integers (which array indices are).

This seems like a pretty severe limitation. I’m not sure how things like MongoDB handle this, but in hindsight, if you are storing both array-based and object-based json data in the one column, you are probably going to be in a world of hurt anyway.

…or, maybe, do cross the streams

All is not lost, however: it’s possible to get just the object-based rows:

SELECT * FROM json_test WHERE data @> '{}';
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(5 rows)

You could then combine this with a previously-forbidden query:

SELECT * FROM json_test WHERE data @> '{}' AND data ->> 'a' > '1';
 id |           data            
----+---------------------------
  3 | {"a": 2, "b": ["c", "d"]}
(1 row)

Indeed, postgres is so awesome you don’t even need to ensure the data @> '{} bit comes first!

But what about limiting to just array-typed data? Turns out we can use the same trick:

SELECT * FROM json_test WHERE data @> '[]';
 id |    data     
----+-------------
  6 | []
  7 | [1, 2, "a"]
(2 rows)

And, again, combine with the other required operator:

SELECT * FROM json_test WHERE data @> '[]' AND data ->> 1 = '2';
 id |    data     
----+-------------
  7 | [1, 2, "a"]
(1 row)

Worth noting is that the @> operator is only available on jsonb columns, so you won’t be able to query mixed-form data in a regular json column.

Wow! What’s next?

This foray into querying jsonb data in postgres was an aside to a project I’m working on to bring json(b) querying to django. With django 1.7’s new custom lookup features, it will be possible to write things like:

# Exact
MyModel.objects.filter(data={'a': 1})
MyModel.objects.exclude(data={})
# Key/element existence
MyModel.objects.filter(data__has='a')
MyModel.objects.filter(data__has_any=['a', 'b'])
MyModel.objects.filter(data__has_all=['a', 'b'])
# Sub/superset of key/value pair testing
MyModel.objects.filter(data__contains={'a': 1})
MyModel.objects.filter(data__in={'a': 1, 'b': 2})
# Get element/field (compare with json)
MyModel.objects.filter(data__get=(2, {'a': 1}))
# Get element/field (compare with scalar, including gt/lt comparisons)
MyModel.objects.filter(data__get=(2, 'a'))
MyModel.objects.filter(data__get__gt=('a', 1))
# key path traversal, compare with json or scalar.
MyModel.objects.filter(data__get=('{a,2}', {'foo': 'bar'}))
MyModel.objects.filter(data__get=('{a,2}', 2))
MyModel.objects.filter(data__get__lte=('{a,2}', 2))

I’m still not sure about the lookup names, especially the last set. The name “get” seems a little generic, and maybe we could use different lookup names for the input type, although only integer and string values are permitted.

Django and Collection+JSON

Recently, I have been reading (and re-reading) Building Hypermedia APIs with HTML5 and Node. There’s lots to like about this book, especially after reading (and mostly discarding) REST API Design Rulebook.

There is one thing that bugs me, and that is the way that templates are used to generate the JSON. As I said to Mike Amundsen:

His response was that he sometimes used JSON.stringify, at other times templates. But it got me thinking. I have written lots of code that serialises Django models, or more recently forms into JSON and other formats. Getting a nice Collection+JSON representation actually maps quite nicely onto these django constructs, as we often have the metadata that is required for the fields.

Consider the following (simple) django model:

class Article(models.Model):
    title = models.CharField('Title of Article', max_length=128)
    content = models.TextField('Content of Article')
    author = models.ForeignKey('auth.User', verbose_name='Author of Article')
    
    @permalink
    def get_absolute_url(self):
        return reverse('article_detail', kwargs={'pk', self.pk})

I don’t normally supply verbose_names, but I have in this case. We’ll see why in a minute.

Now, what I would declare is the obvious JSON representation of this is something like:

{
  "title": "Title goes here",
  "content": "Content goes here",
  "author": 1,
  "href": "…"
}

But, I’m quite interested in Collection+JSON. We might see something like:

{
  "collection": {
    "version": "1.0",
    "href": "…",
    "links": [
      {"href":"…", "rel":"…", "prompt":"…", "name":"…", "render":"string"}
    ],
    "items": [
      {
        "href": "…",
        "data": [
          {"name":"title", "value":"Title goes here", "prompt":"Title of Article"},
          {"name":"content", "value":"Content goes here", "prompt":"Content of Article"},
          {"name":"author", "value":"1", "prompt":"Author of Article"},
        ],
        "links": []
      }
    ]
  }
}

From a django ModelForm, we should be able to easily generate each member of items:

links = getattr(form, 'links', [])
return {
    "data": [
        {"name":f.name, "prompt":f.label, "value":f.value()} for f in form
    ],
    "href": ,
    "links": links
}

The only bit that we are missing out of the form/field data is data type, or more specifically in this case, the available choices that are permitted for the author field. Now, this is missing from the Collection+JSON spec, so I’m not sure how to handle that.

I think this is actually an important problem: if we have a discoverable/hypermedia API, how do we indicate to the client what are valid values that can be entered for a given field?

For those not familiar with django: the verbose_name on a model field is used for the default label on a form field. If you were not using a model, you could just supply a label in the form class.

The other part that is a little hard to think about now are the other attributes: href, and links. Now, these may actually coalesce into one, as links.self.href should give us href. Perhaps we have to look on the form object for a links property. But, in django, it’s not really the domain of the form to contain information about that. For now, I’m going to have a links property on my forms, but that feels dirty too.