Subquery and Subclasses

2019-07-30 @ 17:14:05

here.

Being able to use correlated subqueries in the Django ORM arrived in 1.11, and I also backported it to 1.8.

Quite commonly, I am asked questions about how to use these, so here is an attempt to document them further.

There are three classes that are supplied with Django, but it’s easy to write extensions using subclassing.

Let’s first look at an example of how you might want to use the included classes. We’ll consider a set of temperature sensors, each with a name and a code, both of which are unique. These sensors will log their current temperature at some sort of interval: maybe it’s regular, maybe it varies between devices. We want to keep every reading, but want to only allow one reading for a given sensor+timestamp.

class Sensor(models.Model):
    location = models.TextField(unique=True)
    code = models.TextField(unique=True)


class Reading(models.Model):
    sensor = models.ForeignKey(Sensor, related_name='readings')
    timestamp = models.DateTimeField()
    temperature = models.DecimalField(max_digits=6, decimal_places=3)

    class Meta:
        unique_together = (('sensor', 'timestamp'),)

Some of the things we might want to do for a given sensor:

Get the most recent temperature
Get the average temperature over a given period
Get the maximum temperature over a given period
Get the minimum temperature over a given period

If we start with a single sensor instance, we can do each of these without having to use Subquery and friends:

from django.db.models import Avg, Min, Max

most_recent_temperature = sensor.readings.order_by('-timestamp').first().temperature
period_readings = sensor.readings.filter(
    timestamp__gte=start,
    timestamp__lte=finish,
).aggregate(
    average=Avg('temperature'),
    minimum=Min('temperature'),
    maximum=Max('temperature'),
)

We could also get the minimum or maximum using ordering, like we did with the most_recent_temperature.

If we want to do the same for a set of sensors, mostly we can still achieve this (note how similar the code is to the block above):

sensor_readings = Reading.objects.filter(
  timestamp__gte=start,
  timestamp__lte=finish
).values('sensor').annotate(
  average=Avg('temperature'),
  minimum=Min('temperature'),
  maximum=Max('temperature'),
)

We might get something like:

[
    {
        'sensor': 1,
        'average': 17.5,
        'minimum': 11.3,
        'maximum': 25.9
    },
    {
        'sensor': 2,
        'average': 19.63,
        'minimum': 13.6,
        'maximum': 24.33
    },
]

However, it’s not obvious how we would get all of the sensors, and their current temperature in a single query.

Subquery to the rescue!

from django.db.models.expressions import Subquery, OuterRef

current_temperature = Reading.objects.filter(sensor=OuterRef('pk'))\
                                     .order_by('-timestamp')\
                                     .values('temperature')[:1]

Sensor.objects.annotate(
    current_temperature=Subquery(current_temperature)
)

What’s going on here as that we are filtering the Reading objects inside our subquery to only those associated with the sensor in the outer query. This uses the special OuterRef class, that will, when the queryset is “resolved”, build the association. It does mean that if we tried to inspect the current_temperature queryset, we would get an error that it is unresolved.

We then order the filtered readings by newest timestamp first; this, coupled with the slice at the end will limit us to a single row. This is required because the database will reject a query that results in multiple rows being returned for a subquery.

Additionally, we may only have a single column in our subquery: that’s achieved by the .values('temperature').

But maybe there is a problem here: we actually want to know when the reading was taken, as well as the temperature.

We can do that a couple of ways. The simplest is to use two Subqueries:

current_temperature = Reading.objects.filter(sensor=OuterRef('pk'))\
                                     .order_by('-timestamp')[:1]

Sensor.objects.annotate(
    current_temperature=Subquery(current_temperature.values('temperature')),
    last_reading_at=Subquery(current_temperature.values('timestamp')),
)

However, this will do two subqueries at the database level. Since these subqueries will be performed seperately for each row, each additional correlated subquery will result in more work for the database, with possible performance implications.

What about if we are using Postgres, and are okay with turning the temperature and timestamp pair into a JSONB object?

from django.db.models.expressions import Func, F, Value, OuterRef, Subquery
from django.contrib.postgres.fields import JSONField


class JsonBuildObject(Func):
    function = 'jsonb_build_object'
    output_field = JSONField()


last_temperature = Reading.objects.filter(sensor=OuterRef('pk'))\
                                  .order_by('-timestamp')\
                                  .annotate(
                                      json=JsonBuildObject(
                                          Value('timestamp'), F('timestamp'),
                                          Value('temperature'), F('temperature'),
                                      )
                                   ).values('json')[:1]

Sensor.objects.annotate(
    last_temperature=Subquery(last_temperature, output_field=JSONField())
)

Now, your Sensor instances would have an attribute last_temperature, which will be a dict with the timestamp and temperature of the last reading.

There is also a supplied Exists subquery that can be used to force the database to emit an EXISTS statement. This could be used to set a boolean field on our sensors to indicate they have data from within the last day:

recent_readings = Reading.objects.filter(
    sensor=OuterRef('pk'),
    timestamp__gte=datetime.datetime.utcnow() - datetime.timedelta(1)
)
Sensor.objects.annotate(
    has_recent_readings=Exists(recent_readings)
)

Sometimes we’ll have values from multiple rows that we will want to annotate on from the subquery. This can’t be done directly: you will need to aggregate those values in some way. Postgres has a neat feature where you can use an ARRAY() constructor and wrap a subquery in that:

SELECT foo,
       bar,
       ARRAY(SELECT baz
               FROM qux
              WHERE qux.bar = base.bar
              ORDER BY fizz
              LIMIT 5) AS baz
  FROM base

We can build this type of structure using a subclass of Subquery.

from django.contrib.postgres.fields import ArrayField
from django.core.exceptions import FieldError
from django.db.models.expressions import Subquery

class SubqueryArray(Subquery):
    template = 'ARRAY(%(subquery)s)'

    @property
    def output_field(self):
        output_fields = [x.output_field for x in self.get_source_expressions()]

        if len(output_fields) > 1:
            raise FieldError('More than one column detected')

        return ArrayField(base_field=output_fields[0])

And now we can use this where we’ve used a Subquery, but we no longer need to slice to a single row:

json_reading = JsonBuildObject(
    Value('timestamp'), F('timestamp'),
    Value('temperature'), F('temperature'),
)

last_five_readings = Reading.objects.filter(
    sensor=OuterRef('pk')
).order_by('-timestamp').annotate(
    json=json_reading
).values('json')[:5]

Sensor.objects.annotate(last_five_readings=SubqueryArray(last_five_readings))

Each sensor instance would now have up to 5 dicts in a list in it’s attribute last_five_readings.

We could get this data in a slightly different way: let’s say instead of an array, we want a dict keyed by a string representation of the timestamp:

sensor.last_five_readings = {
    '2019-01-01T09:12:35Z': 15.35,
    '2019-01-01T09:13:35Z': 14.33,
    '2019-01-01T09:14:35Z': 14.90,
    ...
}

There is a Postgres aggregate we can use there to do that, too:

class JsonObjectAgg(Subquery):
    template = '(SELECT json_object_agg("_j"."key", "_j"."value") FROM (%(subquery)s) "_j")'
    output_field = JSONField()


last_five_readings = Reading.objects.filter(
    sensor=OuterRef('pk')
).order_by('-timestamp').annotate(
    key=F('timestamp'),
    value=F('temperature'),
).values('key', 'value')[:5]

Sensor.objects.annotate(last_five_readings=JsonObjectAgg(last_five_readings))

Indeed, we can wrap any aggregate in a similar way: to get the number of values of a subquery:

class SubqueryCount(Subquery):
    template = '(SELECT count(*) FROM (%(subquery)s) _count)'
    output_field = models.IntegerField()

Since other aggregates need to operate on a single field, we’ll need something that ensures there is a single value in our .values(), and extract that out and use that in the query.

class SubquerySum(Subquery):
    template = '(SELECT SUM(%(field)s) FROM (%(subquery)s) _sum)'

    def as_sql(self, compiler, connection, template=None, **extra_context):
        if 'field' not in extra_context and 'field' not in self.extra:
            if len(self.queryset._fields) > 1:
                raise FieldError('You must provide the field name, or have a single column')
            extra_context['field'] = self.queryset._fields[0]
        return super(SubquerySum, self).as_sql(
          compiler, connection, template=template, **extra_context
        )

As I mentioned, it’s possible to write a subclass like that for any aggregate function, although it would be far nicer if there was a way to write that purely in the ORM. Maybe one day…