On Fences and Functions

I grew up on a farm.

We had fences on the farm.

Whilst the jobs associated with fences and fencing are less than fun, the fences themselves are extremely important. They keep the livestock in the correct location. When you have a damaged or incomplete fence, even if it is only damaged in a small way, it can cost significant amounts of money, even human lives. This can vary between keeping Rams from a flock of Ewes that you don’t want them to mate with (because you need to know which Ram mated with which Ewes in order to track progeny), to livestock escaping onto a public road and causing accidents.

Fences are a good thing.


My first career was as a Design and Technology Teacher.

We use fences in woodwork. They are attachments to fixed power tools, such as drill presses and circular saws. They allow us to work safely and to get accurate, easily repeatable results. For instance. we can use a fence to cut sheets of MDF to exactly the same width, ensuring the bookcase we are making is square. Without a fence, it can still be done, but it will certainly be much harder.

Fences are a good thing.


I’d heard people describe Postgres’s CTEs (Common Table Expressions) as an “optimisation fence”. Given my previous uses of the word “fence”, I assumed that this was widely regarded as a good thing.

However, after spending some time writing really complex queries (that are most easily described using a CTE), I happened to read PostgreSQL’s CTEs are optimisation fences. It had (throughout my work within Postgres) become plain to me that each term in a CTE is materialised (if it is referenced at all), before any filtering that might occur later would allow it to be filtered earlier. Postgres is pretty good about pushing these changes down into a sub-query, but it can mean that a CTE performs worse, as it might have to do more work. However, this article points this out in some detail, and it occurred to me that perhaps some people see fences (in general) as an obstacle. Perhaps fencing something in has negative connotations?

I’m not sure that that’s exactly what the author meant (I wonder if it was sarcasm, perhaps), but it did get me thinking about how different backgrounds could result in opposite interpretations of the same terms.


I do want to veer back a bit into a technical manner, and discuss how I have been overcoming the fact that it’s not possible to push the filtering back up the stack in a CTE.

Largely, the issue exists in my code because I have lots of complex queries (as I just mentioned) that are far easier to write, reason about and debug when written using a CTE. I would like to write them as a VIEW, and then stick Django models in front of them, and I’d be able to query them using the ORM, and just have the view as the db_table of the model. It would be really nice.

But this doesn’t work, because some of the queries require aggregating data across models of which there are millions of rows, and some of the database tables are less than optimal. For instance, I have several tables that store an effective_from field, and in the case of superseding, the same set of other fields (person, for instance) means we can know which one applies on a given date. However, to query this, we end up writing a more complex query (instead of being able to do a date <@ daterange query, if the valid period was stored in the table). I’ve learned from this in newer models, but some stuff is too deeply ingrained to be able to be changed just yet.

So, I have a VIEW that turns this into a data that actually contains dateranges, and I can query against that. But, if I use this in a CTE, then it can materialise the whole lot, which can be slow. So, I needed to come up with a way to filter the data earlier.

Functions.

I’ve been writing SQL functions that take parameters, and then filter as early as possible. This then means that it’s a real possibility that we can get <100ms queries for stuff that is really, really complicated (and joins a couple of dozen or more tables in really funky ways). It does mean I can’t query using the Django ORM, but that’s okay: the data I’m getting back doesn’t necessarily map onto a model anyway, and we need to use it as a dict.

More recently, I’ve extended this so that the function (with the relevant parameters, extracted out of the queryset WHERE clauses) can be used as the db_table for a Model. It’s still somewhat hacky, but is very interesting, nonetheless.

Postgres Tree Shootout part 3: Adjacency List using Views

It’s been a while, but I’ve finally gotten off my arsefound some time to revisit this series. As promised last time, I’m going to rewrite the queries from the Adjacency List “solutions” using a View. Indeed, there will be two versions of the view - one which is a MATERIALIZED VIEW. There will also be discussion of when the two different types of view might be best to use.

One of the reasons this post took so long to write was that I was sidetracked by writing an SVG generator that would allow for graphically seeing what the different operations discussed in this series look like in terms of an actual tree. That didn’t eventuate.

We will start by defining what our tree view will actually look like. You’ll notice is it rather like the CTE that we saw in the previous post.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

CREATE RECURSIVE VIEW tree (node_id, ancestors) AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
);

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

All of the insertions do not require access to the tree view, since the beauty of an Adjacency List model is that you only ever need to operate on the immediate parent-child.

Removals

Similarly, we will skip over the simple operations: those don’t require access to any more of the tree than just the parent-child relationship. It’s not until we need to remove a subtree that it becomes interesting.

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
) OR node_id = 2;

If you are paying attention, you will notice that this is virtually identical to the CTE version, except that we no longer need to redeclare the CTE each time. The full tree deletion is the same, as is removing all decscendants:

DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(ancestors)
);

Moves

Again, the operations that don’t require the actual tree are unchanged: this is where the Adjacency List really shines.

Fetches

Since we are starting with the “full” tree, we should be able to use it for all of the queries. It is possible that these queries (unlike those we have seen before) may be slightly slower than the CTE version (specifically, those where the CTE is customised for that operation).

Descendants

Let’s get all of node 10’s descendants:

SELECT node_id FROM tree WHERE 10 = ANY(ancestors);

This query is far less complicated than the CTE version, as expected. However, when dealing with very large datasets, it performs far worse. I have a data set with 221000 nodes, in 1001 different trees. Performing this query takes around 5 seconds, but the customised CTE version takes about 750ms.

Turning this view into a materialised view:

CREATE MATERIALIZED VIEW tree_mat AS
SELECT node_id, ancestors FROM tree;

and then querying that turns this into around 75ms.

To limit the query to nodes to a given depth requires slightly more work.

SELECT node_id, ancestors FROM tree
WHERE ARRAY_POSITION(ancestors, 10) < ARRAY_LENGTH(ancestors, 1) - 2;

Ancestors

Fetching ancestors of a node is again trivial:

SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

And the count of ancestors:

SELECT ARRAY_LENGTH(ancestors, 1) FROM tree WHERE node_id=15;

Getting a set of ancestors to a given depth is actually a little tricky: because we can’t just reverse the end that we add the parent node to the ancestors array, we can’t use that trick. We’ll have to enumerate the rows, and then extract those we care about. You can’t use OFFSET with a variable, otherwise that would be a nice trick.

WITH ancestors AS (
  SELECT unnest(ancestors) AS node_id
  FROM tree
  WHERE node_id = 15
), enumerated AS (
  SELECT
    row_number() OVER () AS row,
    count(*) OVER () AS ancestor_count,
    node_id
  FROM ancestors
)
SELECT node_id
FROM enumerated
WHERE "row" > ancestor_count - 2;

Ugh. That’s way worse than the CTE version.

Special queries

None of the special queries access the tree in any way, so can be omitted for now.

Discussion

So how does using a view stack up to the ad-hoc CTE queries?

Mostly pretty well. In the case where you have only small data sets, then the view that builds up the complete tree each time is not that much of a problem (I ran some tests with tens of thousands of items, and it still performed relatively well). When it moves up to hundreds of thousands, then the ad-hoc CTE versions can greatly outperform the full tree view.

However, using a materialised view changes everything. It now becomes just as fast as querying a table: indeed, that’s just what it is. You could have triggers based on changes to the nodes table causing a REFRESH MATERIALIZED VIEW, but it is worth keeping in mind that this will take some time: in my case, a full refresh of 221000 rows took upwards of 4.5 seconds.

Using a materialised view gets us most of the way to (and leads nicely into the next method), storing a materialised path. The similarity of the names here should be a trigger, but now I’m just making foreshadowing jokes.

Django ComputedField()

A very common pattern, at least in code that I’ve written (and read) is to annotate on a field that uses an expression that is based on one or more other fields. This could then be used to filter the objects, or just in some other way.

The usual method of doing this is:

from django.db import models
from django.db.models.expressions import F, Value
from django.db.models.function import Concat


class PersonQuerySet(models.query.QuerySet):
    def with_name(self):
        return self.annotate(
            name=Concat(F('first_name'), Value(' '), F('last_name'), output_field=models.TextField()),
        )


class Person(models.Model):
    first_name = models.TextField()
    last_name = models.TextField()

    objects = PersonQuerySet.as_manager()

Yes, I’m aware of falsehoods programmers believe about names, but this is an easy-to-follow example.

In order to be able to access the name field, we must use the with_name() queryset method. This is usually okay, but if it is something that we almost always want, it can be a little tiresome. Alternatively, you could override the get_queryset() method of a custom manager, but that makes it somewhat surprising to a reader of the code. There are also some places where a custom manager will not automatically be used, or where it will be cumbersome to include the fields from a custom manager (select_related, for instance).

It would be much nicer if we could write the field declaratively, and have it use the normal django mechanism of defer and only to remove it from the query if required.

class Person(models.Model):
    first_name = models.TextField()
    last_name = models.TextField()
    name = ComputedField(Concat(F('first_name'), Value(' '), F('last_name'), output_field=models.TextField()))

I’ve spent some time digging around in the django source code, and have a fairly reasonable understanding of how fields work, and how queries are built up. But I did wonder how close to a working proof of concept of this type of field we could get without having to change any of the django source code. After all, I was able to backport the entire Subquery expression stuff to older versions of django after writing that. It would be nice to repeat the process here.

There are a few things you need to do to get this to work:

  • store the expression
  • prevent the field from creating a migration
  • ensure the field knows how to interpret data from the database
  • ensure the field adds the expression to it’s serialised version
  • prevent the field from writing data back to the database
  • inject the expression into the query instead of the field name
class ComputedField(models.Field):
    def __init__(self, expression, *args, **kwargs):
        self.expression = expression.copy()
        kwargs.update(editable=False)
        super().__init__(*args, **kwargs)

There is already a mechanism for a field to prevent a migration operation from being generated: it can return a db_type of None.

    def db_type(self, connection):
        return None

We can delegate the responsibility of interpreting the data from the database to the output field of the expression - that’s how it works in the normal operation of expressions.

    def from_db_value(self, value, expression, connection):
        return self.expression.output_field.from_db_value(value, expression, connection)

Storing the expression in the serialised version of a field is explained in the documentation on custom fields:

    def deconstruct(self):
        name, path, args, kwargs = super().deconstruct()
        return name, path, [self.expression] + args, kwargs

To prevent the field from being included in the data we write back to the database turned out to be fairly tricky. There are a couple of mechanisms that can be used, but ultimately the only one that worked in the way I needed was something that is used by the inheritance mechanism. We have to indicate that it is a “private” field. I’m not 100% sure of what the other implications of this might be, but the outcome of making this field private is that it no longer appears in the list of local fields. There is one drawback to this, which I’ll discuss below.

    def contribute_to_class(self, cls, name, private_only=False):
        return super().contribute_to_class(cls, name, True)

So, we only have one task to complete. How do we inject the expression into the query instead of the column?

When django evaluates a queryset, it look at the annotations, and the expressions that are in these. It will then “resolve” these expressions (which means the expression gets told which “query” is being used to evaluate it, allowing it to do whatever it needs to do to make things work).

When a regular field is encountered, it is not resolved: instead it is turned into a Col. This happens in a few different places, but the problem is that a Col should not need to know which query it belongs to: at most it needs to know what the aliased table name is. So, we don’t have a query object we can pass to the resolve_expression method of our expression.

Instead, we’ll need to use Python’s introspection to look up the stack until we find a place that has a reference to this query.

    def get_col(self, alias, output_field=None):
        import inspect

        query = None

        for frame in inspect.stack():
            if frame.function in ['get_default_columns', 'get_order_by']:
                query = frame.frame.f_locals['self'].query
                break
            if frame.function in ['add_fields', 'build_filter']:
                query = frame.frame.f_locals['self']
                break
        else:
            # Aaargh! We don't handle this one yet!
            import pdb; pdb.set_trace()

        col = self.expression.resolve_expression(query=query)
        col.target = self
        return col

So, how does this code actually work? We go through each frame in the stack, and look for a function (or method, but they are really just functions in python) that matches one of the types we know about that have a reference to the query. Then, we grab that, stop iterating and resolve our expression. We have to set the “target” of our resolved expression to the original field, which is how the Col interface works.

This moves the resolve_expression into the get_col, which is where it needs to be. The (resolved) expression is used as the faked column, and it knows how to generate it’s own SQL, which will be put into the query in the correct location.

And this works, almost.

There is one more situation that needs to be taken into account: when we are referencing the field through a join (the x__y lookup syntax you often see in django filters).

Because F() expressions reference the local query, we need to first turn any of these that we find in our computed field’s expression (at any level) into a Col that refers to the correct model. We need to do this before the resolve_expression takes place.

    def get_col(self, alias, output_field=None):
        query = None

        for frame in inspect.stack():
            if frame.function in ['get_default_columns', 'get_order_by']:
                query = frame.frame.f_locals['self'].query
                break
            if frame.function in ['add_fields', 'build_filter']:
                query = frame.frame.f_locals['self']
                break
        else:
            # Aaargh! We don't handle this one yet!
            import pdb; pdb.set_trace()

        def resolve_f(expression):
            if hasattr(expression, 'get_source_expressions'):
                expression = expression.copy()
                expression.set_source_expressions([
                  resolve_f(expr) for expr in expression.get_source_expressions()
                ])
            if isinstance(expression, models.F):
                field = self.model._meta.get_field(expression.name)
                if hasattr(field, 'expression'):
                    return resolve_f(field.expression)
                return Col(alias, field)
            return expression

        col = resolve_f(self.expression).resolve_expression(query=query)
        col.target = self
        return col

There is a repo containing this, which has a bunch of tests showing how the different query types can use the computed field:

https://github.com/schinckel/django-computed-field


But wait, there is one more thing…

A very common requirement, especially if you are planning on using this column for filtering, would be to stick an index on there.

Unfortunately, that’s not currently possible: the mechanism for preventing the field name from being in the write queries, making it a private field, prevents using this field in an index. Anyway, function/expression indexes are not currently supported in Django.

It’s not all bad news though: Markus has a Pull Request that will enable this feauture; from there we could (if db_index is set) automatically add an expression index to Model._meta.indexes in contribute_to_class, but it would also be great to be able to use index_together.

I suspect to get that, though, we’ll need another mechansim to prevent it being in the write queries, but still be a local field.

(Thanks to FunkyBob for suggestions, including suggesting the field at all).

Modelling a residential battery in Postgres

Having five-minute interval generation and consumption data from my PV system means that, in theory at least, I should be able to do some predictive modelling of what adding a battery to the system will do to my import and export levels, and more importantly, determine how much money having a battery of a specific capacity and efficiency would save me.

There are several approaches that could be used to model a battery. They all boil down to the same main idea:

  • The battery has a fixed capacity
  • The battery has a current level
  • If generating more power than consuming, this excess could be fed into the battery (assuming the battery is not full)
  • If the battery is full, any excess generation is exported
  • If we are consuming more power than generating, and we have battery level > 0, then we draw power from the battery
  • Any power we are drawing from the battery is divided by the round-trip efficiency of the battery
  • Any excess consumption that cannot be supplied by the battery results in imports

There are some assumptions that this model makes.

  • There is no current (as in Amperes, not “now”) limit to either exporting or importing
  • If we are both importing and exporting, the battery can handle both of these. In practice, most batteries take some time to switch between charging and discharging
  • The usable capacity is the same as the battery capacity

In all of these cases, the limitations we are not addressing would more than likely result in a poorer outcome than the model will predict.

So, on to the technical details of creating the model. In the first instance, I tried using a recursive view (using in this case the 13.5kWh/90% efficient Tesla Powerwall):

CREATE OR REPLACE RECURSIVE VIEW battery_status(
  timestamp,
  capacity,
  efficiency,
  level,
  import,
  export
) AS

SELECT '2019-03-07'::TIMESTAMP,
       13500 AS capacity,
       0.90 AS efficiency,
       0::NUMERIC AS level,
       0::NUMERIC AS import,
       0::NUMERIC AS export

 UNION ALL

SELECT battery.timestamp + INTERVAL '5 minutes',
       battery.capacity,
       battery.efficiency,
       GREATEST(
         0,
         battery.level + delta.amount_to_battery - delta.amount_from_battery
       )::NUMERIC(12, 3) AS level,
       (summary.import - delta.amount_from_battery)::NUMERIC(8, 3) AS import,
       (summary.export - delta.amount_to_battery)::NUMERIC(8, 3) AS export
  FROM battery_status battery
  LEFT OUTER JOIN summary ON (summary.timestamp = battery.timestamp + INTERVAL '5 minutes')
 INNER JOIN LATERAL (
   SELECT LEAST(summary.import, battery.level / battery.efficiency) AS amount_from_battery,
          LEAST(summary.export, battery.capacity - battery.level) AS amount_to_battery
 ) delta ON (true)

Whilst this sort-of worked, it was unsatisfactory in a couple of ways. Firstly, it took ages to run. Way longer than I expected, even when I used materialised views, and changed the interval to hourly instead of 5 minutely.

Additionally, because I have some gaps in my data (for instance, when my power was disconnected, or the inverter was off because I was cleaning the panels, or whatever), the view stopped recursing at this point, so I was never actually able to get a result that went for more than a couple of months. Even generating missing values seemed to be insufficient, so at some point I gave up on this.

I even tried the same approach on a daily summary - this was never going to give me anything close to an accurate result, but at that point I was stretching thin, so tried a bunch of things.

It seemed unusual that it was taking a long as it did (dozens of seconds, and that was only to build a few months of data). It should only need to pass over the summary data once, and store…that gave me an idea. I could write a fairly simple python generator function to perform the modeling.

def battery_status(capacity, efficiency):
    result = plpy.execute('SELECT * FROM summary')
    energy = 0

    for row in result:
        export_amount = row['export']
        if export_amount and energy < capacity:
            energy += export_amount
            if energy > capacity:
                export_amount = energy - capacity
                energy = capacity
            else:
                export_amount = 0
        import_amount = row['import']
        if import_amount and energy > 0:
            energy -= import_amount / efficiency
            if energy < 0:
                import_amount = -energy * efficiency
                energy = 0
            else:
                import_amount = 0
        yield (row['timestamp'], energy, import_amount, export_amount)

Following it through, we can see some benefits to this approach. We don’t need to pass the capacity and efficiency (or the current energy level in the battery) through to each iteration: because this is a generator, the state remains in the function, and those variables are just re-used over and over again.

The only thing that might be unusual there is the plpy.execute('SELECT * FROM summary'): this hints that this code is actually going inside a plpythonu function and living inside the database. We can use a composite type to make the return values easier to deal with:

CREATE TYPE battery_status AS (
    timestamp TIMESTAMP,
    stored NUMERIC,
    import NUMERIC,
    export NUMERIC
);

CREATE FUNCTION battery_status(capacity NUMERIC, efficiency NUMERIC)
RETURNS SETOF battery_status AS $$

result = plpy.execute('SELECT * FROM summary')
energy = 0

for row in result:
    export_amount = row['export']
    if export_amount and energy < capacity:
        energy += export_amount
        if energy > capacity:
            export_amount = energy - capacity
            energy = capacity
        else:
            export_amount = 0
    import_amount = row['import']
    if import_amount and energy > 0:
        energy -= import_amount / efficiency
        if energy < 0:
            import_amount = -energy * efficiency
            energy = 0
        else:
            import_amount = 0
    yield (row['timestamp'], energy, import_amount, export_amount)

$$ LANGUAGE plpythonu STRICT IMMUTABLE;

Again, I’ll reiterate that in this context, we apply both charging and discharging in the same period (which could happen), but assumes that there is zero turnaround time required between charge and discharge, and that there is no limit to the charge and discharge rate. We could probably improve the model to take those two into account fairly easily though.

To revisit what the other data stores look like, let’s have a look at my summary view:

CREATE OR REPLACE VIEW summary AS

SELECT "timestamp",
       COALESCE(generation.energy, 0::double precision) AS generation,
       COALESCE(import.energy, 0) AS import,
       COALESCE(export.energy, 0) AS export,
       import.energy::double precision
           + generation.energy
           - export.energy::double precision AS consumption
  FROM generation
  JOIN import USING ("timestamp")
  JOIN export USING ("timestamp")

My generation table just contains the generation for each interval: the import and export are both views that use the cumulative import/export totals and calculate the value by subtracting the previous row:

CREATE OR REPLACE VIEW import AS

SELECT import_cumulative."timestamp",
       import_cumulative.energy
           - lag(import_cumulative.energy, 1) OVER (ORDER BY import_cumulative."timestamp")
           AS energy
  FROM import_cumulative
 ORDER BY import_cumulative."timestamp";

We can create a materialised view of the battery status per-day, since that’s all we will need for the purposes of cost calculations:

CREATE MATERIALIZED VIEW daily_summary_battery AS

SELECT battery_status."timestamp"::date AS "timestamp",
       sum(battery_status.import) AS import,
       sum(battery_status.export) AS export
  FROM battery_status(13500::numeric, 0.9)
       battery_status("timestamp", stored, import, export)
 GROUP BY (battery_status."timestamp"::date)
 ORDER BY (battery_status."timestamp"::date);

So, lets have a look at what results we get.

Refreshing the materialised view takes me around 9 seconds (this is quite stable), and I currently have about 106000 records (370 days of data, 288 records per day if we got them all). This is nice: it’s not immediate, but not going to be a deal-breaker if I need to run a bunch of different models of different battery capacity or efficiencies.

I can use the same cost functions from my actual cost calculator to look at what this battery model will do to my costs:

CREATE VIEW daily_battery_cost AS
SELECT daily_summary_battery."timestamp" AS date,
       daily_summary_battery.import,
       daily_summary_battery.export,
       cost(retailer.*,
            daily_summary_battery.import::integer,
            daily_summary_battery.export::integer) AS cost,
       billing_periods.period
  FROM daily_summary_battery
  JOIN billing_periods ON daily_summary_battery."timestamp" <@ billing_periods.period
  JOIN retailer ON retailer.name = billing_periods.retailer
 ORDER BY daily_summary_battery."timestamp";

And for comparisons in my billing periods:

CREATE VIEW billing_period_battery_costs AS
SELECT daily_battery_cost.period,
       sum(daily_battery_cost.cost) AS cost
  FROM daily_battery_cost
 GROUP BY daily_battery_cost.period
 ORDER BY daily_battery_cost.period;

CREATE VIEW billing_period_battery_savings AS
SELECT costs.period,
       costs.cost - battery.cost AS savings
  FROM billing_period_costs costs
  JOIN billing_period_battery_costs battery USING (period)
 ORDER BY costs.period;

And for a per-day saving: it works out to around $1.05 per day.

Which is ~$380 per year.

A Tesla Powerwall 2 battery would cost me (excluding installation, but including the SA battery rebate), in excess of $5000. That’s still almost a 15 year payback time.


What’s quite amusing is that I distinctly remember calculating what I thought a battery would save me shortly after getting my PV system installed. I have such a strong recollection of it that I can remember exactly where I was: I was out running at lunch time, and was partway up the hill at the back of Flinders Uni.

And my in-head calculations were that it would save me $1 per day.


I’m actually more interested in a DC coupled battery system, partly because of the reduction in round-trip efficiency losses, but more so because I should be able to stick more panels in that would just feed the battery. I’m especially interested in the Fronius Gen24 inverters and a compatible battery, however there aren’t clear details on what that would cost. The benefit of this is that in SA there is currently a 10kW inverter limit; this includes the inverter in an AC coupled battery like the Tesla Powerwall.

These are likely to come midway through this year. I’ll be interested to see the pricing.

I think I’ll need to do a bit more on my calculations to be able to model this: my panel_factor calculations have a 5kW limit - this goes away, and instead the limit applies after taking the battery charging into account. But this can wait for another day.

Random, Fixed Ordering and Pagination

Consider the following situation:

We have a set of items that we want to show in a random order. The order, however, should remain fixed for a period. The display of items will need to be paginated, and we have over 200,000 items.

If the ordering is not truly random, then we could have an expression index, and allow ordering based on that. However, that doesn’t really help out with the pagination, and issues around LIMIT/OFFSET ordering of very large sets.

Instead, I came up with a solution this afternoon that uses Postgres Materialised Views.

Let’s start with a Django model:

class Post(models.Model):
    title = models.TextField()
    content = models.TextField()
    posted_at = models.DateTimeField()

We can build up a materialised view that associates each post with a position:

CREATE MATERIALISED VIEW post_ordering AS

SELECT post.id AS post_id,
       row_number() OVER () AS position
  FROM (
    SELECT id FROM blog_post ORDER BY random()
  ) post;

CREATE INDEX post_ordering_id ON post_ordering(post_id);
CREATE INDEX post_ordering_position ON post_ordering(position);

Because a materialised view stores a copy of the data, we need to index it if we want to get performance benefits.

This materialised view is interesting from the pagination perspective because there are no gaps in the position values, even if there are missing post_id values. This means we can use filtering to paginate, instead of having to use the regular slicing notation.

Do note that the ordering needs to be done inside a subquery, otherwise it will not work correctly.

We do need to stick a model in front of this to access it from Django:

class PostPosition(models.Model):
    post = models.OneToOneField(
        primary_key=True,
        related_name='ordering',
        on_delete=models.DO_NOTHING,
    )
    position = models.IntegerField()

    class Meta:
        managed = False
        db_table = 'post_ordering'

Now, because we have a relationship defined, we may filter using this:

page_2 = Post.objects.filter(
    ordering__position__gte=100,
    ordering__position__lt=200
).order_by('ordering__position')

From here, we just need to create a custom paginator to use this instead of the normal slicing.

class PositionPaginator(django.core.paginators.Paginator):
    def page(self, number):
        number = self.validate_number(number)
        bottom = (number - 1) * self.per_page
        top = bottom + self.per_page
        if top + self.orphans >= self.count:
            top = self.count
        return self._get_page(self.object_list.filter(
            ordering__position__gte=bottom,
            ordering__position__lt=top
        ).order_by('ordering__position'), self)

The last piece of the puzzle is getting the refresh of the ordering. In postgres, you just need:

REFRESH MATERIALISED VIEW post_ordering;

Postgres Generated Columns

A little while ago, I wrote about creating a nice way to have a Django ComputedField. It is pretty neat, except it needs to do some black magic to sniff up the stack to work around a limitation in the way a Ref/Col works in Django.

The way it works is that you define the expression in Python, and it evaluates it in the database, allowing you to query based on this, and have it automatically annotated on.

What it doesn’t do, however, is actually store that value in the database. Indeed, if you are actually querying on this column, you’d probably want to have a functional index that uses the same expression, so that the database can do a reasonable job of improving query times on that column.

New in Postgres 12 is a feature that really piqued my interest: Generated Columns.

These are basically what the ComputedField does, but at the database level. And, instead of it being an expression that is evaluated at query time, it is instead an expression that is evaluated at write time, and stored in an actual column (that could then have an index applied to it).

Let’s have a look at an example:

CREATE TABLE person (
  person_id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  first_name TEXT,
  last_name TEXT,
  full_name TEXT GENERATED ALWAYS AS (
    COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
  ) STORED
);

Again, I’m aware I’m failing to note at least one of the falsehoods programmers believe about names.

Notes about this:

  • I’ve used the similar (preferred) syntax for generating the primary key.
  • You must have the keyword STORED at the end of the column definition: or more specifically, the syntax must be <column> <type> GENERATED ALWAYS AS (<expression>) STORED.
  • You may only refer to other columns within the same row: similar to how a functional index would work.
  • You may not refer to other generated columns: that would likely require parsing the expressions to determine which one to calculate first. I’d love to see postgres implement that at some point though!

So, let’s have a look at that with some data:

INSERT INTO person (first_name, last_name)
VALUES
    ('alice', 'aardvark'),
    ('bob', 'burger'),
    ('chuck', NULL),
    (NULL, 'darris');

And when we query it:

SELECT * FROM person;
 person_id │ first_name │ last_name │   full_name
 ------------------------------------------------------
         1 │ alice      │ aardvark  │ alice aardvark
         2 │ bob        │ burger    │ bob burger
         3 │ chuck      │ <NULL>    │ chuck
         4 │ <NULL>     │ darris    │  darris
(4 rows)

Oh, bother. We didn’t want the space before ‘darris’ (or the one you can’t see, after ‘chuck’). We’ll have to fix that in a sec.

So, what happens when we try to write to the full_name column?

UPDATE person SET first_name = 'dave', full_name='foo' WHERE first_name IS NULL;
ERROR:  column "full_name" can only be updated to DEFAULT
DETAIL:  Column "full_name" is a generated column.

Okay, that’s nice to know. If the error was ignored, we could have just used a custom django field and ignored the value, but we’ll need something similar to how ComputedField prevents writing values. I’ll have to investigate that further.

But, back onto the fact I forgot to trim any leading or trailing spaces. It turns out that there is no way to alter the expression that is being used in a generated column. Which, when you think a little more about it, sort-of makes sense. At the very least, it would need to write new values to each column where the new value was different to the old value.

Instead, you need to drop the column, and re-add it with the correct expression. You’ll almost certainly want to do this in a transaction:

BEGIN;
ALTER TABLE person DROP COLUMN full_name;
ALTER TABLE person ADD COLUMN full_name TEXT
      GENERATED ALWAYS AS (TRIM(
        COALESCE(first_name, '') || ' ' ||
        COALESCE(last_name, '')
      )) STORED;
COMMIT;

And now we can query our table again:

SELECT * FROM person;
 person_id │ first_name │ last_name │   full_name
 ------------------------------------------------------
         1 │ alice      │ aardvark  │ alice aardvark
         2 │ bob        │ burger    │ bob burger
         3 │ chuck      │ <NULL>    │ chuck
         4 │ <NULL>     │ darris    │ darris
(4 rows)

Sweet.

Opening Hours Redux

A few years ago, I wrote up some stuff about Postgres Composite Types in Django. Holy cow, that appears to be 5 years ago.

Anyway, it’s come up a bit recently on #postgresql on IRC, and I thought I might expand a little on how I’m currently using that concept, and some ideas that could be used to do more.

The composite type itself is quite straightforward: we store two values representing the opening time, and then the length of time that the business is open. This allows us to model things that go over midnight without having to worry about a bunch of checks about (start > finish), and whatever that means.

CREATE TYPE open_period AS (
  start TIME,
  length INTERVAL
);

We could have use a DOMAIN TYPE to limit the length to less than or equal to 24 hours, however I’ll omit that for now.

From there, we can use the new type wherever we would use any other type: including in an array.

CREATE TABLE stores (
  store_id SERIAL PRIMARY KEY,
  name TEXT,
  default_opening_hours open_period[7]
);

Nothing new here since the last post.

However, let’s look at coming up with a mechanism that prevents subsequent days from overlapping with one another. Since we have all of these in an array, we can write a single function that ensures the values are acceptable together. There are a couple of different approaches we could use. One would be to “materialise” the open periods, and then compare them to one another.

CREATE OR REPLACE FUNCTION materialise(open_period, DATE)
RETURNS TSRANGE AS $$

  SELECT TSRANGE(
    ($2 || 'T' || $1.start || 'Z')::TIMESTAMP,
    ($2 || 'T' || $1.start || 'Z')::TIMESTAMP + $1.length
  );

$$ LANGUAGE SQL STRICT IMMUTABLE;



CREATE OR REPLACE FUNCTION materialise(open_period)
RETURNS TSRANGE AS $$

  SELECT materialise($1, '1979-01-01'::DATE);

$$ LANGUAGE SQL STRICT IMMUTABLE;

We have a version there that takes a specific day, but also one that just uses the epoch date. That may be useful later…

…but right now we want to be able to apply subsequent days to each item in the array, and then look for overlaps.

WITH default_opening_hours AS (
  SELECT UNNEST(ARRAY[
    ('09:00', '08:00')::open_period,  -- Monday, but we won't really use that today.
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '12:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('10:00', '07:00')::open_period,
    ('11:00', '06:00')::open_period
  ]) AS hours
), materialised_opening_hours AS (
  SELECT materialise(hours, (now() + INTERVAL '1 day' * row_number() OVER ())::DATE) AS hours
    FROM default_opening_hours
), overlapping_periods AS (
  SELECT hours && LEAD(hours, 1) OVER () AS overlap
    FROM materialised_opening_hours
)
SELECT * FROM overlapping_periods WHERE overlap;

We don’t (at this point in time) really mind if the weekdays that the open periods refer to is the correct weekday: instead we just need to ensure that we have 7 consecutive days, with the sequence of open_periods materialised to the correct value based on the offset from the first day.

This is pretty close: it will find any overlaps between days, except for if the finish of the last day overlaps with the start of the next day. We can cheat a little to make that work:

WITH default_opening_hours AS (
  SELECT UNNEST(ARRAY[
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '12:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('10:00', '07:00')::open_period,
    ('11:00', '06:00')::open_period
  ]) AS hours
), materialised_opening_hours AS (
  SELECT materialise(hours, (now() + INTERVAL '1 day' * row_number() OVER ())::DATE) AS hours
    FROM default_opening_hours

   UNION ALL

  SELECT materialise((SELECT hours FROM default_opening_hours LIMIT 1),
                     (now() + INTERVAL '8 days')::DATE
  )
), overlapping_periods AS (
  SELECT hours && LEAD(hours, 1) OVER () AS overlap
    FROM materialised_opening_hours
)
SELECT * FROM overlapping_periods WHERE overlap;

Let’s put a couple of values in there to see that the overlaps are detected:

WITH default_opening_hours AS (
  SELECT UNNEST(ARRAY[
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '28:00')::open_period,
    ('09:00', '12:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('10:00', '07:00')::open_period,
    ('11:00', '24:00')::open_period
  ]) AS hours
), materialised_opening_hours AS (
  SELECT materialise(hours, (now() + INTERVAL '1 day' * row_number() OVER ())::DATE) AS hours
    FROM default_opening_hours

   UNION ALL

  SELECT materialise((SELECT hours FROM default_opening_hours LIMIT 1),
                     (now() + INTERVAL '8 days')::DATE)
), overlapping_periods AS (
  SELECT hours && LEAD(hours, 1) OVER () AS overlap
    FROM materialised_opening_hours
)
SELECT * FROM overlapping_periods WHERE overlap;
 overlap
─────────
 t
 t
(2 rows)

Now, we can bundle this up into a function that we can then use in a CHECK CONSTRAINT (as we cannot use a subquery directly in a check constraint):

CREATE OR REPLACE FUNCTION find_subsequent_day_overlaps(open_period[])
RETURNS BOOLEAN AS $$
  SELECT NOT EXISTS (
      WITH materialised_opening_hours AS (
        SELECT materialise(hours, (now() + INTERVAL '1 day' * row_number() OVER ())::DATE) AS hours
          FROM unnest($1) hours

         UNION ALL

        SELECT materialise($1[1], (now() + INTERVAL '8 days')::DATE)
      ), overlapping_periods AS (
        SELECT hours && LEAD(hours, 1) OVER () AS overlap FROM materialised_opening_hours
      )
      SELECT * FROM overlapping_periods WHERE overlap
    )
$$ LANGUAGE SQL STRICT IMMUTABLE;
ALTER TABLE store
ADD CONSTRAINT prevent_default_opening_hours_overlap
CHECK (find_subsequent_day_overlaps(default_opening_hours));

And, now to check:

INSERT INTO stores (name, default_opening_hours) VALUES
(
  'John Martins',
  ARRAY[
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('09:00', '12:00')::open_period,
    ('09:00', '08:00')::open_period,
    ('10:00', '07:00')::open_period,
    ('11:00', '06:00')::open_period
  ]
);

And with invalid data:

INSERT INTO stores (name, default_opening_hours) VALUES (
  'Foo',
  ARRAY[('09:00', '08:00')::open_period,
        ('09:00', '08:00')::open_period,
        ('09:00', '08:00')::open_period,
        ('09:00', '12:00')::open_period,
        ('09:00', '08:00')::open_period,
       ('10:00', '07:00')::open_period,
       ('11:00', '24:00')::open_period]);

…which throws an exception:

ERROR:  new row for relation "store" violates check constraint "prevent_default_opening_hours_overlap"
DETAIL:  Failing row contains (2, Foo, {"(09:00:00,08:00:00)","(09:00:00,08:00:00)","(09:00:00,08:00:00...).

Righto, what other things might we want to do with these composite types?

Some businesses have a concept of “Day Parts”, for instance, within a single day we may want to look at a sub-set of that day. For instance, sales during Breakfast may have a different set of Key Performance Indicators than those during Lunch or Tea. So, we may want to store something like:

+------------+------------+-------------+
| Day Period | Start time | Finish time |
+============+============+=============+
| Breakfast  |    06:00   |     10:00   |
| Lunch      |    11:00   |     14:00   |
| Tea        |    16:00   |     21:00   |
+------------+------------+-------------+

Again, it might make sense to store these as an open_period instead, because they could go over midnight. We’ll also want the name to be unique per store, but that’s something we can do with a plain old unique index:

CREATE TABLE day_parts (
  day_part_id SERIAL PRIMARY KEY,
  store_id INTEGER REFERENCES stores(store_id),
  name TEXT,
  period OPEN_PERIOD
);
CREATE UNIQUE INDEX distinct_name_per_day_period ON day_parts (store_id, name)

We can use an exclusion constraint to prevent overlaps, however you may need to enable support first:

CREATE EXTENSION btree_gist;

Now, let’s see the exclusion constraint:

ALTER TABLE day_parts
ADD CONSTRAINT prevent_overlapping_day_parts
EXCLUDE USING gist(
  materialise(period) WITH &&,
  store_id WITH =
);

Turns out that is actually easier to implement than the values in the array!


The other thing we may want to do is annotate on the Day Period to an object of some sort. To do this we will need to materialise all of the day periods for the given day(s), and see which one of them our timestamp is within. We will expand on a couple of things here: specifically, we need to have a timezone within which our store is located. To make things easier to follow, we will have all of the DDL code anew. This is partly because this example will not use the concept of default opening hours.

CREATE TABLE stores (
  store_id SERIAL PRIMARY KEY,
  name TEXT UNIQUE NOT NULL,
  timezone TEXT NOT NULL CHECK (now() AT TIME ZONE timezone IS NOT NULL)
  -- Note we validate that this column contains a valid timezone by
  -- attempting to coerce now() to that timezone: this will report
  -- back an error if the timezone name is not recognised.
);

CREATE TABLE day_parts (
  day_part_id SERIAL PRIMARY KEY,
  store_id INTEGER REFERENCES stores (store_id),
  name TEXT,
  period OPEN_PERIOD,
  CONSTRAINT prevent_overlapping_day_parts EXCLUDE USING gist(
    materialise(period) WITH &&,
    store_id WITH =
  )
);

CREATE UNIQUE INDEX distinct_name_per_day_period ON day_parts(store_id, name);

CREATE TABLE transactions (
  transaction_id SERIAL PRIMARY KEY,
  store_id INTEGER REFERENCES stores (store_id),
  timestamp TIMESTAMPTZ,
  amount NUMERIC
);

And now add some data:

INSERT INTO stores (name, timezone)
     VALUES ('John Martins', 'Australia/Adelaide');

INSERT INTO day_parts (store_id, name, period)
     VALUES (1, 'Morning',   ('09:00', '02:00')),
            (1, 'Lunch',     ('11:00', '03:00')),
            (1, 'Afternoon', ('14:00', '03:00')),
            (1, 'Evening',   ('17:00', '04:00'));


INSERT INTO transactions (store_id, timestamp, amount)
     VALUES (1, '2019-05-27T01:25:22', '33.77'),
            (1, '2019-05-27T04:33:47', '724.75'),
            (1, '2019-05-27T06:00:42', '47.48'),
            (1, '2019-05-27T08:33:12', '3.44');

The first thing we want to do is show the transactions at the time it was in the store when they were completed:

SELECT transactions.*,
       transactions.timestamp AT TIME ZONE stores.timezone AS local_time
  FROM transactions
 INNER JOIN stores USING (store_id)
 transaction_id │ store_id │       timestamp        │ amount │     local_time
              1 │        1 │ 2019-05-27 01:25:22+00 │  33.77 │ 2019-05-27 10:55:22
              2 │        1 │ 2019-05-27 04:33:47+00 │ 724.75 │ 2019-05-27 14:03:47
              3 │        1 │ 2019-05-27 06:00:42+00 │  47.48 │ 2019-05-27 15:30:42
              4 │        1 │ 2019-05-27 08:33:12+00 │   3.44 │ 2019-05-27 18:03:12

Next, we want to annotate on which day part corresponds to that local time:

SELECT trans.*,
       day_part.name AS day_part
  FROM (
    SELECT transactions.*,
           transactions.timestamp AT TIME ZONE stores.timezone AS local_time
      FROM transactions
     INNER JOIN stores USING (store_id)
  ) trans
  LEFT OUTER JOIN LATERAL (
    SELECT materialise(day_parts.period, trans.local_time::DATE) AS day_part,
           day_parts.name
      FROM day_parts
     WHERE day_parts.store_id = trans.store_id
  ) day_part ON (day_part @> local_time)
 transaction_id │ store_id │       timestamp        │ amount │     local_time      │ day_part
────────────────┼──────────┼────────────────────────┼────────┼─────────────────────┼───────────
              1 │        1 │ 2019-05-27 01:25:22+00 │  33.77 │ 2019-05-27 10:55:22 │ Morning
              2 │        1 │ 2019-05-27 04:33:47+00 │ 724.75 │ 2019-05-27 14:03:47 │ Afternoon
              3 │        1 │ 2019-05-27 06:00:42+00 │  47.48 │ 2019-05-27 15:30:42 │ Afternoon
              4 │        1 │ 2019-05-27 08:33:12+00 │   3.44 │ 2019-05-27 18:03:12 │ Evening

From there, we could look at aggregation within day parts, or comparisons between different days, but only the same day part.


Those of you paying attention may notice that I used TSRANGE instead of TSTZRANGE in the materialise functions. Can we look at a version of these functions that accepts a timezone as well as a date (and open_period), and gives back a TSTZRANGE?

CREATE OR REPLACE FUNCTION materialise(open_period, DATE, timezone TEXT)
RETURNS TSTZRANGE AS $$

  SELECT TSTZRANGE(
    ($2 || 'T' || $1.start)::TIMESTAMP AT TIME ZONE timezone,
    (($2 || 'T' || $1.start)::TIMESTAMP + $1.length) AT TIME ZONE timezone
  );

$$ LANGUAGE SQL STRICT IMMUTABLE;

Now we can rewrite our last query:

SELECT transactions.*,
       day_part.name AS day_part
  FROM transactions
  LEFT OUTER JOIN LATERAL (
    SELECT materialise(day_parts.period, transactions.timestamp::DATE, stores.timezone) AS day_part,
           day_parts.name
      FROM day_parts
      INNER JOIN stores USING (store_id)
     WHERE day_parts.store_id = transactions.store_id
  ) day_part ON (day_part.day_part @> transactions.timestamp)
 transaction_id │ store_id │       timestamp        │ amount │ day_part
              1 │        1 │ 2019-05-27 01:25:22+00 │  33.77 │ Morning
              2 │        1 │ 2019-05-27 04:33:47+00 │ 724.75 │ Afternoon
              3 │        1 │ 2019-05-27 06:00:42+00 │  47.48 │ Afternoon
              4 │        1 │ 2019-05-27 08:33:12+00 │   3.44 │ Evening

Although, I think this might be a bit harder to do aggregation per-day, because you’d still need to get the “local” timestamp to put them on the same day, although, that’s actually part of the materialisation of the store’s full open period anyway.

Too many rows!

We had an interesting problem at work today.

It seems that the sequence on one of our tables had exceeded 231 (2147483648), and since the primary key was an SERIAL column, this was problematic. From Numeric Types, we can see that only 4 bytes were used. Not enough.

This was presenting some problems, was was only limited to two aspects of the system, neither of which meant that it was worth bringing down the rest of the system to fix it.

Since the obvious fix would have resulted in downtime of somewhere between 20 minutes and an hour, we discarded that:

ALTER TABLE big_problem_here
ALTER COLUMN id TYPE BIGINT;

We tried that on our staging database, which had far fewer rows. That took 20 minutes to rewrite the table, during which time the entire database was essentially out of order.

Instead, we came up with a different solution:

Create a new table, which is identical to the other table (including using the same sequence: this is very important), except has the bigger integer type:

CREATE TABLE big_problem_here_fixed (
  id BIGINT NOT NULL PRIMARY KEY DEFAULT nextval('big_problem_here_id_seq'::regclass),
  user_id INTEGER NOT NULL,
  ...
);

ALTER TABLE big_problem_here_fixed
ADD CONSTRAINT user_id_refs_id_6ccf0120
FOREIGN KEY (user_id) REFERENCES auth_user (id)
DEFERRABLE INITIALLY DEFERRED;

CREATE INDEX big_problem_here_fixed_user_id
ON big_problem_here_fixed(user_id);

Then, we can copy the data from the old table into the new one. This is safe, because we can’t have any new rows inserted into the old table at the moment anyway, as all writes to it occur in a transaction, and there are no cases (other than a celery task, which only runs late at night) where an update or delete is not accompanied by at least one new row.

If this happens to you: you would need to ensure that there are not any rows being updated or deleted whilst you are doing the copy, otherwise you would lose those changes.

INSERT INTO big_problem_here_fixed SELECT * FROM big_problem_here;

This part took about an hour. I’m not sure if it took longer than the staging rewrite because there is more to do in this case, or just because there is more data.

Finally, the last part. We can rename both tables in a single transaction, so there won’t be any errors from missing tables between when we rename the first and the second.

BEGIN;
  ALTER TABLE big_problem_here RENAME TO big_problem_here_replaced;
  ALTER TABLE big_problem_here_fixed RENAME TO big_problem_here;
COMMIT;

Orders please!

Things are better when they are ordered.

Why might we want things ordered?

From a user experience perspective, if things are not ordered the same way upon repeated viewings, they are liable to get confused, and think that things have changed, when really it’s only the order that has changed.

Some time ago, there were versions of Microsoft Office that used to move menu items closer to the start of the menu that had been used more recently and more frequently. This is not “most recent documents”, or onything like that: this was the menu items like “Format” and other tools. There’s a pretty good rationalé for why they stopped doing that!

Why might we need things ordered?

There are a few reasons why we must have things ordered. One good example is the reason Django will complain about a queryset without ordering when you use pagination: if you select the first 10 items when there is no ordering, and then select the next 10 items, there is no guarantee that those items will not have some of the same items as the first set. Without ordering, pagination is useless. There’s no way to page through items and ensure that you see all of them.

Another reason is where you need to process things in a specific order: either to ensure that the first (or last) matching item is “the winner”. We’ll come across one of those shortly.

Ordering by a specific value or set of values (or function over the same) is usually a sane way to order. This could be alphabetically, by date of insertion, or some numeric value, for instance.

How should we order things?

Sometimes, it is obvious how we should order a list of things. Sometimes the order doesn’t matter, and sometimes we want to allow an ordering that doesn’t rely at all on any attributes of the objects in question.

Let’s look at a list of condition names that might be used for a workplace agreement in Australia.

  • Regular Hours
  • Public Holiday
  • Saturday
  • Sunday
  • After 6pm
  • Regular Overtime (first 2 hours)
  • Extended Overtime (more than 2 hours)

For those outside of Australia: under many work circumstances, employees will get paid at a higher rate because of what are called “Penalty” conditions. This could include being paid, for example, a 15% loading on a Saturday, and a 25% loading on a Sunday. We can rewrite this list with the loading percentages:

Condition Penalty Rule
Regular Hours 100 % When no other rules apply
Public Holiday 200 % Any hours on a public holiday
Saturday 125 % Any hours on a Saturday
Sunday 150 % Any hours on a Sunday
After 6pm 117 % Any hours between 6pm and midnight
Regular Overtime 150 % After 8 hours worked in a day or 38 hours in a week
Extended Overtime 200 % After 2 hours of overtime worked, or any overtime on a Sunday

So now we have two possible ways of ordering our list, by name:

  • After 6pm
  • Extended Overtime (more than 2 hours)
  • Public Holiday
  • Regular Hours
  • Regular Overtime (first 2 hours)
  • Saturday
  • Sunday

Or by penalty rate:

  • 100% Regular Hours
  • 117% After 6pm
  • 125% Saturday
  • 150% Regular Overtime (first 2 hours)
  • 150% Sunday
  • 200% Extended Overtime (more than 2 hours)
  • 200% Public Holiday

However, there’s actually a little more complexity than this. We can have the rules in this order, and apply them sequentially, and we should get the right result for an employee’s wages. But, according to our rules, we should be applying Extended Overtime for any overtime hours on a Sunday. So we need some way of controlling the ordering manually.

Specifically, we want to have our conditions ordered in a way that we can check each rule in turn, and ensure that the last rule that is applied will be the one that should apply, if multiple rules do match.

  • Regular hours
  • After 6pm
  • Saturday
  • Sunday
  • Regular Overtime
  • Extended Overtime
  • Public Holiday

So, how do we actually store this custom ordering?

The naive solution is to store an integer for each item, and ensure that we only have one of each value. Let’s look at that as a Django model:

class Condition(models.Model):
    name = models.TextField()
    penalty = models.DecimalField()
    position = models.IntegerField(unique=True)

Then, we can ensure we get our conditions in the correct order by applying an ordering to our queryset:

queryset = Condition.objects.order_by('position')

However, this solution is not without it’s drawbacks. Primarily, to insert an object between two others, we renumber all of the subsequent items:

condition = Condition(name='New condition', position=3, penalty=Decimal('1.20'))
Condition.objects.filter(position__gte=condition.position).update(position=F('position') + 1)
condition.save()

We must ensure that within our database, we are able to suspend constraints until our transaction is complete, or we must perform actions in a specific order. If we care about not having “holes” between our positions, then we also need to renumber remaining objects when we remove one.

Let’s have a look at how we could handle this, with some client and server code. We can use a formset here, which has ordering permitted, and then we just need the JS to make that happen:

ConditionPositionFormSet = modelformset_factory(Condition, ordering=True, fields=())

class OrderConditionsView(FormView):
    form_class = ConditionPositionFormSet

This will result in rewriting all items one by one, which must happen in a transaction, or in the correct order.


Alternatively, we could use a form that just indicates the new position of a given condition, and have that form handle the UPDATE queries: this will result in just two queries instead.

class ConditionPositionForm(forms.ModelForm):
    class Meta:
        model = Condition
        fields = ('position',)


class OrderConditionsView(UpdateView):
    """
    Handle an AJAX request to move one condition to a new position.

    The condition in that position will be moved to the next position, and so on.
    """
    form_class = ConditionPositionForm

    def form_valid(self, form):
        position = form.cleaned_data('position')
        self.get_queryset.filter(position__gte=position)\
                         .exclude(pk=form.instance.pk)\
                         .update(position=F('position') + 1)
        form.save()
        return HttpResponse(json.dumps({'status': 'ok'}))

    def form_invalid(self, form):
      return HttpResponse(json.dumps({
        'status': 'error',
        'message': form.fields['position'].errors,
      }), status_code=409)

    def get_queryset(self):
        return Condition.objects.order_by('position')

# urls.py: simplified

urlpatterns = [
  include('/condition/', [
    path('/condition/<int:pk>/position/', OrderConditionsView.as_view(), name='position'),
    # ...
  ], namespace='condition'),
  # ...
]

In this case, our HTML and JS can just send back the one that was moved, and it’s new position. We’ll want some mechanism for indicating that the move was successful though, because it may not be permitted to move some objects to specific positions.

  <ul id="conditions">
    {% for condition in object_list %}
      <li data-position-url="{% url 'condition:position' condition.pk %}">
        {{ condition.name }} ({{ condition.penalty }}%)
      </li>
    {% endfor %}
  </ul>

  <script>
  // Uses SortableJS: https://github.com/SortableJS/Sortable

  require(['Sortable'], (Sortable) => {
    function save(evt) {
      const condition = evt.item;

      function revert() {
        if (evt.oldIndex < evt.newIndex) {
          // Move back to the old index.
          condition.before(condition.parentElement.children[evt.oldIndex]);
        } else {
          // The old index is actually in the wrong spot now,
          // move back to the old index + 1
          condition.before(condition.parentElement.children[evt.oldIndex + 1]);
        }
      }

      fetch(condition.dataset.positionUrl, {
        method: 'POST',
        body: {position: evt.newIndex},
        credentials: 'same-origin'
      }).then(
        response => response.json()
      ).then(
        response => {
          if (response.status === 'error') {
            // We got an error: let's put the object back into the previous
            // position and then show that error.
            revert();
            // Maybe a nicer way than this though!
            alert(response.message);
          }
        }
      );
    }

    new Sortable(document.getElementById('conditions'), {onUpdate: evt => save});

  });
  </script>

So, are there any other ways we could store the positions of our conditions, which would allow us to insert a new one, or move one around, without having to renumber a bunch of other items?

Why yes, yes we can.

Instead of storing each position as an integer, we can instead store two numbers, and use the ratio of these two numbers as our ordering:

POSITION = ExpressionWrapper(
  models.F('position_a') / models.F('position_b'),
  output_field=models.DecimalField()
)


class PositionQuerySet(models.query.QuerySet):
    def with_position(self):
        return self.annotate(position=POSITION)

    def order_by_position(self):
        return self.order_by(POSITION)


class Condition(models.Model):
    name = models.TextField()
    penalty = models.DecimalField()
    position_a = models.IntegerField()
    position_b = models.IntegerField()

    class Meta:
      unique_together = (
        ('position_a', 'position_b'),  # In fact, this is not quite strict enough, more below.
      )

The logic behind this is that here is always the ability to find a fraction that is between two other fractions. Our initial “first” item can be given the values (1, 1), which will evaluate to 1.0

To place another item, we have three cases:

  • The new value is before the first item in the list.
  • The new value is between two existing items.
  • The new value is after the last item in the list.

Let’s look at how we can calculate the position for the newly added item:

  • Before the first item in the list:

    Increment the denominator (position_b), and use the same numerator (position_a).

    For instance, adding a new first value where our current first value is (1, 1) would give us (1, 2), which evaluates to 0.5, less than 1.0

  • After the last item in the list:

    Increment the numerator (position_a), and use the same denominator (position_b).

    For instance, adding a new value where the last value is (5, 1) would give us (6, 1), which evaluates to 6.0, greater than 5.0

  • Inserting an item between two existing items.

    Use the sum of the two numerators (position_a) as the numerator, and the sum of the two denominators (position_b) as the new denominator. Optionally, you can then simplify this fraction if there are common factors.

    For instance, if we have two values (1, 4) and (5, 1), we can make a new value (6, 5), which evaluates to 1.2, which is between 0.25 and 5.0.

The same operations can be performed on a move: we don’t actually care what the values are, just that the relative values are still consistent. Whenever we move an object, we just change it’s position_a and/or position_b to whatever makes sense for the new position.

<ul id="conditions">
  {% for condition in object_list %}
    <li class="condition"
        data-position-url="{% url 'condition:position' condition.pk %}"
        data-position-a="{{ condition.position_a }}"
        data-position-b="{{ condition.position_b}}">
      {{ condition.name }}
    </li>
  {% endfor %}
</ul>

<script>
  require(['sortable'], (Sortable) => {
    function save(evt) {
      const item = evt.item;
      const prev = condition.previousElementSibling;
      const next = condition.nextElementSibling;

      if (prev == null && next == null) {
        // This is the only item in the list.
        item.dataset.position_a = 1;
        item.dataset.position_b = 1;
      } else if (prev == null) {
        // This is now the first item in the list.
        item.dataset.position_a = next.dataset.position_a;
        item.dataset.position_b = Number(next.dataset.position_b) + 1;
      } else if (next == null) {
        // This is now the last item in the list.
        item.dataset.position_a = Number(prev.dataset.position_a) + 1;
        item.dataset.position_b = prev.dataset.position_b;
      } else {
        // Neither the first nor the last.
        item.dataset.position_a = Number(prev.dataset.position_a) + Number(next.dataset.position_a);
        item.dataset.position_b = Number(prev.dataset.position_b) + Number(next.dataset.position_b);
      }

      function revert() {
        if (evt.oldIndex < evt.newIndex) {
          // Move back to the old index.
          condition.before(condition.parentElement.children[evt.oldIndex]);
        } else {
          // The old index is actually in the wrong spot now,
          // move back to the old index + 1
          condition.before(condition.parentElement.children[evt.oldIndex + 1]);
        }
      }

      // This could do with some error handling...
      fetch(condition.dataset.positionUrl, {
        method: 'POST',
        body: {
          position_a: item.dataset.position_a,
          position_b: item.dataset.position_b
        },
        credentials: 'same-origin'
      }).then(response => response.json()).then(
        response => {
          if (response.status == 'error') {
            revert();
          }
        }
      );
    }


  })
</script>

Expression Exclusion Constraints

Today I was working with a junior developer, and was lucky enough to be able to explain exclusion constraints to them. I got partway through it before I realised that the Django model we were working on did not have a range field, but instead had a start and a finish.

class Leave(models.Model):
    person = models.ForeignKey(
        'person.Person',
        related_name='approved_leave',
        on_delete=models.CASCADE,
    )
    start = models.DateTimeField()
    finish = models.DateTimeField()

It turns out that this is not a problem. You can use any expression in a constraint:

ALTER TABLE leave_leave
ADD CONSTRAINT prevent_overlapping_leave
EXCLUDE USING gist(person_id WITH =, TSTZRANGE(start, finish) WITH &&)

Whilst we have application-level validation in place to prevent this, there is a code path that allows it (hence the desire to implement this). Because this is an exclusion constraint, we won’t be able to use the NOT VALID syntax, but will instead have to either fix the invalid data, or use a WHERE clause to only apply the constraint to “new” data.

ALTER TABLE leave_leave
ADD CONSTRAINT prevent_overlapping_leave
EXCLUDE USING gist(person_id WITH =, TSTZRANGE(start, finish) WITH &&)
WHERE start > '2019-07-19';

The other benefit of this is that it creates an index that includes TSTZRANGE(start, finish), which could be used for querying, but also will ensure that start <= finish for all rows.