Set-returning and row-accepting functions in Django and Postgres
-
Comments:
- here.
Postgres set-returning functions are an awesome thing. With them, you can do fun things like unnesting and array, and will end up with a new row for each item in the array. For example:
class Post(models.Model):
author = models.ForeignKey(AUTH_USER_MODEL, related_name='posts')
tags = ArrayField(base_field=TextField(), null=True, blank=True)
created_at = models.DateTimeField()
content = models.TextField()
The equivalent SQL might be something like:
CREATE TABLE blog_post (
id SERIAL NOT NULL PRIMARY KEY,
author_id INTEGER NOT NULL REFERENCES auth_user (id),
tags TEXT[],
created_at TIMESTAMPTZ NOT NULL,
content TEXT NOT NULL
);
We can “explode” the table so that we have one tag per row:
SELECT author_id, UNNEST(tags) AS tag, created_at, content
FROM blog_post;
To do the same sort of thing in Django, we can use a Func
:
from django.db.models import F, Func
Post.objects.annotate(tag=Func(F('tags'), function='UNNEST'))
In practice, just like in the Django docs, I’ll create a convenience function:
class Unnest(Func):
function = 'UNNEST'
@property
def output_field(self):
output_fields = [x.output_field for x in self.get_source_expressions()]
if len(output_fields) == 1:
return output_fields[0].base_field
return super(Unnest, self).output_field
The opposite of this is aggregation: in the case of UNNEST
, it’s almost ARRAY_AGG
, although because of the handling of nested arrays, this doesn’t quite round-trip. We already know how to do aggregation in Django, so I will not discuss that here.
Hovewer, there is another related operation: what if you want to turn a row into something else. In my case, this was turning a row from a result into a JSON object.
SELECT id,
to_jsonb(myapp_mymodel) - 'id' AS "json"
FROM myapp_mymodel
This will get all of the columns except ‘id’, and put them into a new column called “json”.
But how do we get Django to output SQL that will enable us to use a Model as the argument to a function? Ultimately, we want to get to the following:
class ToJSONB(Func):
function = 'TO_JSONB'
output_field = JSONField()
MyModel.objects.annotate(
json=ToJSONB(MyModel) - Value('id')
).values('id', 'json')
Our first attempt could be to use RawSQL
. However, this has a couple of problems. The first is that we are writing lots of raw SQL, the second is that it won’t work so well if the table is aliased by the ORM. That is, if you use this in a join or subquery, where Django automatically assigns an alias to this table, then referring directly to the table name will not work.
MyModel.objects.annotate(json=Raw("to_jsonb(myapp_mymodel) - 'id'", [], output_field=JSONField()))
Instead, we need to dynamically find out what the current alias for the model is in this query, and use that. We’ll also want to figure out how to “subtract” the id key from the JSON object.
class Table(django.db.models.Expression):
def __init__(self, model, *args, **kwargs):
self.model = model
self.query = None
super(Table, self).__init__(*args, **kwargs)
def resolve_expression(self, query, *args, **kwargs):
clone = super(Table, self).resolve_expression(query, *args, **kwargs)
clone.query = query
return clone
def as_sql(self, compiler, connection, **kwargs):
if not self.query:
raise ValueError('Unresolved Table expression')
alias = self.query.table_map.get(self.model._meta.db_table, [self.model._meta.db_table])[0]
return compiler.quote_name_unless_alias(alias), []
Okay, there’s a fair bit going on there. Let’s look through it. We’ll start with how we use it:
MyModel.objects.annotate(json=ToJSONB(Table(MyModel)))
We create a Table
instance, which stores a reference to the model. Technically, all we need later on is the database table name that will be used, but we’ll keep the model for now.
When the ORM “resolves” the queryset, we grab the query object, and store a reference to that.
When the ORM asks us to generate some SQL, we look at the query object we have a reference to, and see if our model’s table name has an entry in the table_map
dict: if so, we get the first entry from that, otherwise we just use the table name.
Okay, what about being able to remove the entry in the JSONB object for ‘id’?
We can’t just use the subtraction operator, because Postgres will try to convert the RHS value into JSONB first, and fail. So, we need to ensure it renders it as TEXT. We also need to wrap it in an ExpressionWrapper
, so we can indicate what the output field type will be:
id_value = models.Func(models.Value('id'), template='%(expressions)s::TEXT')
MyModel.objects.annotate(
json=ExpressionWrapper(
ToJSONB(Table(MyModel)) - id_value, output_field=JSONField()
)
)
I also often use a convenience Cast
function, that automatically does this based on the supplied output_field
, but this is a little easier to use here. Note there is a possible use for ToJSONB
in a different context, where it doesn’t take a table, but some other primitive.
There’s one more way we can use this construct: the geo_unique_indexer function from a previous post needs a table name, but also the name of a field to omit from the index. So, we can wrap this up nicely:
class GeoMatch(models.Func):
function = 'geo_unique_indexer'
output_field = JSONField()
def __init__(self, model, *args, **kwargs):
table = Table(model)
pk = models.Value(model._meta.pk.db_column or model._meta.pk.name)
return super(GeoMatch, self).__init__(table, pk, *args, **kwargs)
This is really tidy: it takes the model class (or maybe an instance, I didn’t try), and builds a Table
, and gets the primary key. These are just used as the arguments for the function, and then it all works.