Fallback values in Django

It’s not uncommon to have some type of cascading of values in a system. For instance, in our software, we allow a Brand to have some default settings, and then a Location may override some or all of these settings, or just fallback to the brand settings. I’m going to have a look at how this type of thing can be implemented using Django, and a way that this can be handled seamlessly.

We’ll start with our models:

class Brand(models.Model):
    brand_id = models.AutoField(primary_key=True)
    name = models.TextField()


class Location(models.Model):
    location_id = models.AutoField(primary_key=True)
    brand_id = models.ForeignKey(Brand, related_name='locations')
    name = models.TextField()


WEEKDAYS = [
  (1, _('Monday')),
  (2, _('Tuesday')),
  (3, _('Wednesday')),
  (4, _('Thursday')),
  (5, _('Friday')),
  (6, _('Saturday')),
  (7, _('Sunday')),
]


class BrandSettings(models.Model):
    brand = models.OneToOneField(Brand, primary_key=True, related_name='settings')
    opening_time = models.TimeField()
    closing_time = models.TimeField()
    start_day = models.IntegerField(choices=WEEKDAYS)


class LocationSettings(models.Model):
    location = models.OneToOneField(Location, primary_key=True, related_name='_raw_settings')
    opening_time = models.TimeField(null=True, blank=True)
    closing_time = models.TimeField(null=True, blank=True)
    start_day = models.IntegerField(choices=WEEKDAYS, null=True, blank=True)

We can’t use an abstract base model here, because the LocationSettings values are all optional, but the BrandSettings are not. We might have a look later at a way we can have a base model and inherit-and-change-null on the fields. In the place where we have used this, the relationship between Location and Brand is optional, which complicates things even further.

In practice, we’d have a bunch more settings, but this will make it much easier for us to follow what is going on.

To use these, we want to use a value from the LocationSettings object if it is set, else fall-back to the BrandSettings value for that column.

Location.objects.annotate(
    opening_time=Coalesce('settings__opening_time', 'brand__settings__opening_time'),
    closing_time=Coalesce('settings__closing_time', 'brand__settings__closing_time'),
    start_day=Coalesce('settings__start_day', 'brand__settings__start_day'),
)

And this is fine, but we can make it easier to manage: we want to be able to use Location().settings.start_day, and have that fall-back, but also build some niceness so that we can set values in a nice way in the UI.

We can use a postgres view, and then have a model in front of that:

CREATE OR REPLACE VIEW location_actualsettings AS (
  SELECT location_id,
         COALESCE(location.opening_time, brand.opening_time) AS opening_time,
         COALESCE(location.closing_time, brand.closing_time) AS closing_time,
         COALESCE(location.start_day, brand.start_day) AS start_day
    FROM location_location
   INNER JOIN location_brandsettings brand USING (brand_id)
   INNER JOIN location_locationsettings location USING (location_id)
)

Notice that we have used INNER JOIN for both tables: we are making the assumption that there will always be a settings object for each brand and location.

Now, we want a model in front of this:

class ActualSettings(models.Model):
    location = models.OneToOneField(Location, primary_key=True, related_name='settings')
    opening_time = models.TimeField(null=True, blank=True)
    closing_time = models.TimeField(null=True, blank=True)
    start_day = models.IntegerField(choices=WEEKDAYS, null=True, blank=True)

    class Meta:
        managed = False

We want to indicate that it should allow NULL values in the columns, as when we go to update it, None will be taken to mean “use the brand default”.

As for the ability to write to this model, we have a couple of options. The first is to make sure that when we edit instances of the model, we actually use the Location()._raw_settings instance instead of the Location().settings. The other is to make the ActualSettings view have an update trigger:

CREATE OR REPLACE FUNCTION update_location_settings()
RETURNS TRIGGER AS $$

BEGIN

  IF (TG_OP = 'DELETE') THEN
    RAISE NOTICE 'DELETE FROM location_locationsettings WHERE location_id = %', OLD.location_id;
    DELETE FROM location_locationsettings WHERE location_id = OLD.location_id;
    RETURN OLD;
  ELSIF (TG_OP = 'UPDATE') THEN
    UPDATE location_locationsettings
       SET opening_time = NEW.opening_time,
           closing_time = NEW.closing_time,
           start_day = NEW.start_day
     WHERE location_locationsettings.location_id = NEW.location_id;
    RETURN NEW;
  ELSIF (TG_OP = 'INSERT') THEN
    INSERT INTO location_locationsettings (SELECT NEW.*);
    RETURN NEW;
  END IF;
  RETURN NEW;
END;

$$ LANGUAGE plpgsql VOLATILE;

CREATE TRIGGER update_location_settings
       INSTEAD OF INSERT OR UPDATE OR DELETE
       ON location_actualsettings
       FOR EACH ROW EXECUTE PROCEDURE update_location_settings();

And this works as expected: however it is subject to a pretty significant drawback. If you add columns to the table/view, then you’ll need to update the function. Indeed, if you add columns to the tables, you’ll need to update the view too.

In many cases, this will be sufficient: those tables may not change much, and when they do, it’s just a matter of writing new migrations to update the view and function.


In practice, having the writeable view is probably overkill. You can just use a regular view, with a model in front of it, and then use that model when you need to use the coalesced values, but use the raw model when you are setting values.

You can even make it so that as a UI affordance, you show what the brand fallback value is instead of the None value:

class SettingsForm(forms.ModelForm):
    class Meta:
        model = LocationSettings
        fields = (
            'opening_time',
            'closing_time',
            'start_day'
        )

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # We'll probably want to make sure we use a select_related() for this!
        brand = self.instance.location.brand
        brand_settings = brand.settings

        for name, field in self.fields.items():
            # See if the model knows how to display a nice value.
            display = 'get_{}_display'.format(name)
            if hasattr(brand_settings, display):
                brand_value = getattr(brand_settings, display)()
            else:
                brand_value = getattr(brand_settings, name)

            # If we have a time, then we want to format it nicely:
            if isinstance(brand_value, datetime.time):
                brand_value = Template('').render(Context({
                  'value': brand_value
                }))

            blank_label = _('Default for {brand}: {value}').format(
                brand=brand.name,
                value=brand_value,
            )

            # If we have a select that is _not_ a multiple select, then we
            # want to make it obvious that the brand default value can be
            # selected, or an explicit choice made.
            if hasattr(field, 'choices') and field.choices[0][0] == '':
                field.widget.choices = field.choices = [
                    (_('Brand default'), [('', blank_label)]),
                    (_('Choices'), list(field.choices[1:]))
                ]
            else:
                # On all other fields, set the placeholder, so that no value
                # entered will show the brand default label.
                field.widget.attrs['placeholder'] = blank_label

As mentioned in a comment: this uses a couple of lookups to get to the BrandSettings, you’d want to make sure your view used a .select_related():

class LocationSettingsView(UpdateView):
    form_class = SettingsForm

    def get_object(self):
        return LocationSettings.objects.select_related('location__brand__settings').get(
            location=self.kwargs['location']
        )

Again, this is all simplified when we have the requirement that there is always a Brand associated with a Location, and each of these always has a related settings object. It’s the latter part of this that is a little tricky. You can have objects automatically created in a signal handler, but in that case it would have to use default values.


Just from a DRY perspective, it would be great if you could have all three models inherit from the one base class, and have the view and trigger function update automatically.

In order to do that, we’ll need to do a bit of magic.

class SettingsBase(models.Model):
    opening_time = models.TimeField()
    closing_time = models.TimeField()
    start_day = models.IntegerField(choices=WEEKDAYS)

    class Meta:
        abstract = True

    def __init_subclass__(cls):
        if getattr(cls, '_settings_optional', False):
            for field in cls._meta.fields:
                field.null = True
                field.blank = True


class BrandSettings(SettingsBase):
    brand = models.OneToOneField(
        Brand,
        primary_key=True,
        related_name='settings',
        on_delete=models.CASCADE,
    )


class LocationSettings(SettingsBase):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='raw_settings',
        on_delete=models.CASCADE,
    )
    _settings_optional = True


class ActualSettings(SettingsBase):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='settings',
        on_delete=models.DO_NOTHING,
    )
    _settings_optional

    class Meta:
        managed = False

The magic is all clustered in the one spot, and Django’s order it does things makes this easy. By the time __init_subclass__ is evaluated, the subclass exists, and has all of the inherited fields, but none of the non-inherited fields. So, we can update those fields to not be required, if we find a class attribute _settings_optional that is true.

Automatically creating or replacing the view is a bit more work.

class ActualSettings(BaseSettings):
    location = models.OneToOneField(
        Location,
        primary_key=True,
        related_name='settings',
        on_delete=models.DO_NOTHING,
    )
    _settings_optional = True

    class Meta:
        managed = False

    @classmethod
    def view_queryset(cls):
        settings = {
            attribute: Coalesce(
              'raw_settings__{}'.format(attribute),
              'brand__settings__{}'.format(attribute)
            ) for attribute in (f.name for f in cls._meta.fields)
            if attribute != 'location'
        }
        return Location.objects.annotate(**settings).values('pk', *settings.keys())

This would then need some extra machinery to put that into a migration, and then, when running makemigrations, we’d want to automatically look at the last rendered version of that view, and see if what we have now differs. However, intercepting makemigrations, and changing the operations it creates is something I have not yet figured out how to achieve.

Instead, for Versioning complex database migrations I wound up creating a new management command.

A nicer syntax might be to have some way of defining a postgres view by using a queryset.

ActualSettings = Location.objects.annotate(
    opening_time=Coalesce('_raw_settings__opening_time', 'brand__settings__opening_time'),
    closing_time=Coalesce('_raw_settings__closing_time', 'brand__settings__closing_time'),
    start_day=Coalesce('_raw_settings__start_day', 'brand__settings__start_day'),
).values('location_id', 'opening_time', 'closing_time', 'start_day').as_view()

The problem with this is that we can’t do that in a model definition, as the other models are not loaded at this point in time.

Another possible syntax could be:

class ActualSettings(View):
    location = models.F('location_id')
    opening_time = Coalesce('_raw_settings__opening_time', 'brand__settings__opening_time')
    closing_time = Coalesce('_raw_settings__closing_time', 'brand__settings__closing_time')
    start_day = Coalesce('_raw_settings__start_day', 'brand__settings__start_day')

    class Meta:
      queryset = Location.objects.all()

… but I’m starting to veer off into a different topic now.


Actually writing a trigger function that handles all columns seamlessly is something that we should be able to do. Be warned though, this one is a bit of a doozy:

CREATE OR REPLACE FUNCTION update_instead()
RETURNS TRIGGER AS $$
DECLARE
  primary_key TEXT;
  target_table TEXT;
  columns TEXT;

BEGIN
  -- You must pass as first parameter the name of the table to which writes should
  -- actually be made.
  target_table = TG_ARGV[0]::TEXT;

  -- We want to get the name of the primary key column for the target table,
  -- if that was not already supplied.
  IF (TG_ARGV[1] IS NULL) THEN
    primary_key = (SELECT column_name
                     FROM information_schema.table_constraints
               INNER JOIN information_schema.constraint_column_usage
                    USING (table_catalog, table_schema, table_name,
                           constraint_name, constraint_schema)
                    WHERE constraint_type = 'PRIMARY KEY'
                      AND table_schema = quote_ident(TG_TABLE_SCHEMA)
                      AND table_name = quote_ident(target_table));
  ELSE
    primary_key = TG_ARGV[1]::TEXT;
  END IF;

  -- We also need the names of all of the columns in the current view.
  columns = (SELECT STRING_AGG(quote_ident(column_name), ', ')
               FROM information_schema.columns
              WHERE table_schema = quote_ident(TG_TABLE_SCHEMA)
                AND table_name = quote_ident(TG_TABLE_NAME));

  IF (TG_OP = 'DELETE') THEN
    EXECUTE format(
      'DELETE FROM %1$I WHERE %2$I = ($1).%2$I',
      target_table, primary_key
    ) USING OLD;
    RETURN OLD;
  ELSIF (TG_OP = 'INSERT') THEN
    -- columns must be treated as a string, because we've already
    -- quoted the columns in the query above.
    EXECUTE format(
      'INSERT INTO %1$I (%2$s) (SELECT ($1).*)',
      target_table, columns
    ) USING NEW;
    RETURN NEW;
  ELSIF (TG_OP = 'UPDATE') THEN
    EXECUTE format(
      'UPDATE %1$I SET (%2$s) = (SELECT ($1).*) WHERE %3$I = ($1).%3$I',
      target_table, columns, primary_key
    ) USING NEW;
    RETURN NEW;
  END IF;

  RAISE EXCEPTION 'Unhandled.';
END;

$$ LANGUAGE plpgsql VOLATILE;

There are some things I learned about postgres when doing this: specifically that you can use the EXECUTE format('SELECT ... ($1).%s', arg) USING NEW syntax: the format() function makes it much neater than using string concatenation, and using the EXECUTE '...($1).%s' USING ... form was the only way I was able to access the values from the NEW and OLD aliases within an execute. There’s also a bunch of stuff you have to do to make sure that the columns line up correctly when updating or inserting into the target table.

We can then apply this to our view:

CREATE TRIGGER update_instead
INSTEAD OF UPDATE OR INSERT OR DELETE
ON location_actualsettings
FOR EACH ROW
EXECUTE PROCEDURE update_instead('location_locationsettings', 'location_id');

PV Solar panel analysis in postgres

Australian (and especially South Australian) electricity is expensive. That is probably part of the reason that SA is leading the world in residential rooftop solar.

But it is hard to negotiate the electricity retail market. Knowing your daily usage and feed-in enables you to determine which retailer you should choose, but only if you have the tools available to you to compare what your costs would have been with each retail plan.

And this does not even begin to address the question of if it is cost-effective to increase the size of your PV generation system: most supply companies (who, as it turns out, are different entities to the retailers) have per-phase limits of 5kW: that is, you may never feed back into the grid more than 5kW at any point in time.

Since I had my Fronius Smart Meter installed, I have been able to run reports and get my daily import and export values: these are also available through my retailer, and even through SA Power Networks, the distributor.

I have my data stored in three tables:

REATE TABLE generation (
  "timestamp" TIMESTAMP UNIQUE PRIMARY KEY,
  "energy" FLOAT
);

CREATE TABLE import_cumulative (
  "timestamp" TIMESTAMP UNIQUE PRIMARY KEY,
  "energy" INTEGER
);

CREATE TABLE export_cumulative (
  "timestamp" TIMESTAMP UNIQUE PRIMARY KEY,
  "energy" INTEGER
);

This is really only necessary because I need to take the cumulative values in the import and export tables, and turn them into absolute values for that period:

CREATE OR REPLACE VIEW import AS (
  SELECT timestamp,
         energy - LAG(energy, 1) OVER (ORDER BY timestamp) AS energy
    FROM import_cumulative
   ORDER BY timestamp
);

CREATE OR REPLACE VIEW export AS (
  SELECT timestamp,
         energy - LAG(energy, 1) OVER (ORDER BY timestamp) AS energy
    FROM export_cumulative
   ORDER BY timestamp
);

Now, from these views, it is possible to do some modelling of what different retailers would cost, but they generally need stuff per-day, so we can create a view that combines them. Since I will be doing stuff later that also needs consumption and generation, I will also include them:

CREATE OR REPLACE VIEW consumption AS
SELECT timestamp, 
       import.energy + generation.energy - export.energy AS energy
  FROM generation
 INNER JOIN import USING (timestamp)
 INNER JOIN export USING (timestamp);

And our summary (for each interval):

CREATE OR REPLACE VIEW summary AS (
    SELECT timestamp,
           generation.energy AS generation,
           import.energy AS import,
           export.energy AS export,
           import.energy + generation.energy - export.energy AS consumption
      FROM generation
     INNER JOIN import USING (timestamp)
     INNER JOIN export USING (timestamp)    
);

Getting close to having some usable data there now. What about daily summaries?

CREATE OR REPLACE VIEW daily_summary AS (
    SELECT timestamp::DATE AS date,
           SUM(generation) AS generation,
           SUM(import) AS import,
           SUM(export) AS export,
           SUM(consumption) AS consumption
      FROM summary
     GROUP BY timestamp::DATE
);

That gives us per-day summaries of our generation, import and export.

Here are some values from my system:

SELECT * FROM daily_summary LIMIT 10;
date generation import export consumption
2019-03-07 29851.915 7587 24191 13247.915
2019-03-08 14033.2047222222 7578 9156 12455.2047222222
2019-03-09 28074.8316666667 5746 20961 12859.8316666667
2019-03-10 15239.6936111111 7669 9144 13764.6936111111
2019-03-11 21101.1441666667 6528 12657 14972.1441666667
2019-03-12 28231.4280555556 5955 22382 11804.4280555556
2019-03-13 22880.3077777778 6040 17411 11509.3077777778
2019-03-14 30440.8319444444 6658 24791 12307.8319444444
2019-03-15 30141.3888888889 6576 26227 10490.3888888889
2019-03-16 29286.4413888889 7336 22224 14398.4413888889

(10 rows)

Right. So now we want to be able to look at costs for different retailers. There is a magnificent resource for getting this data: Energy Made Easy. The rates differ for different postcodes (WAT): I’m not sure if they are the same across a whole state or not. Probably not.

Anyway, it turns out that there are some retailers who have three different rates, depending upon your usage during a day. It’s not clear to me if this value is averaged across the whole month when you are on a smart meter: I know it is when you are on one of the old spinning disk meters. I assume it would be calculated on each day, as I think that works out as more expensive for the customer. Diamond Energy appears to have per-month limits, and four different rates, so I just pro-rated them to a 30 day month. I did not bother modelling the fourth rate, as I established fairly early on that they are not going to be close to being the cheapest. I may implement that later, just for completeness.

So, to be able to calculate costs for all retailers, we need to store up to three rates, and the two kWh thresholds that trigger those limits. Note that the second

CREATE TABLE retailer (
    name TEXT PRIMARY KEY,
    supply_charge NUMERIC NOT NULL,
    kWh_rate_1 NUMERIC NOT NULL,
    kWh_limit_1 NUMERIC,
    kWh_rate_2 NUMERIC,
    kWh_limit_2 NUMERIC,
    kWh_rate_3 NUMERIC,
    pay_on_time_discount NUMERIC,
    feed_in_tariff NUMERIC NOT NULL
);

ALTER TABLE retailer 
    ADD CONSTRAINT rate_2_requires_limit_1 
        CHECK (kWh_rate_2 IS NULL OR kWh_limit_1 IS NOT NULL),
    ADD CONSTRAINT rate_3_requires_limit_2 
        CHECK (kWh_rate_3 IS NULL OR kWh_limit_2 IS NOT NULL),
    ADD CONSTRAINT rate_3_requires_rate_2 
        CHECK (kWh_rate_3 IS NULL OR kWh_rate_2 IS NOT NULL);

The three check constraints just ensure that we do not accidentally insert a rate where there is no lower tier rate, or no limit in place to apply it from.

Here is a query that inserts values for the retailers for my postcode:

INSERT INTO retailer (
  name, supply_charge, 
  kWh_rate_1, kWh_limit_1, 
  kWh_rate_2, kWh_limit_2, 
  kWh_rate_3, 
  feed_in_tariff, 
  pay_on_time_discount)
VALUES 
  ('AGL Essentials', 0.8855, 0.3575, NULL, NULL, NULL, NULL, 0.1630, NULL),
  ('Origin Energy', 0.9033, 0.4002, 4, 0.4272, NULL, NULL, 0.1800, 0.10),
  ('AGL Solar Savers', 0.8910, 0.4038, NULL, NULL, NULL, NULL, 0.1800, NULL),
  ('Click Banksia Solar', 0.9966, 0.3941, NULL, NULL, NULL, NULL, 0.1700, NULL),
  ('Power Direct Residential Saver', 0.8910, 0.4038, NULL, NULL, NULL, NULL, 0.1420, 0.13),
  ('SA Simply RAA VPP', 0.9863, 0.3600, NULL, NULL, NULL, NULL, 0.1500, NULL),
  ('Kogan', 0.9566, 0.3154, NULL, NULL, NULL, NULL, 0.0748, NULL),
  ('PowerShop Shopper Market', 1.1912, 0.3760, NULL, NULL, NULL, NULL, 0.1020, 0.15),
  ('AGL Residential Essentials', 0.7752, 0.3512, NULL, NULL, NULL, NULL, 0.1420, NULL),
  ('Diamond Energy Everyday Renewable Saver', 0.9565, 0.3659, 3.33, 0.3948, 11.1, 0.4285, 0.12, NULL),
  ('Red Energy Living Energy Saver', 0.9790, 0.3641, NULL, NULL, NULL, NULL, 0.1200, NULL),
  ('Blue NRG DMO - General Usage', 0.7810, 0.3736, 10.96, 0.5775, NULL, NULL, 0.0680, NULL),
  ('Energy Australia Total Plan Home', 0.8580, 0.4070, NULL, NULL, NULL, NULL, 0.15, NULL),
  ('Alinta Energy No Fuss (Single Rate)', 0.9130, 0.3317, 3.29, 0.3695, NULL, NULL, 0.0950, NULL),
  ('Lumo Basic', 1.0120, 0.3539, NULL, NULL, NULL, NULL, 0.15, NULL),
  ('Powerband Home Flat + Solar FIT', 1.0322, 0.3066, 10.95, 0.3139, NULL, NULL, 0.1150, NULL),
  ('Origin Energy Econnex', 0.9081, 0.4024, 10.96, 0.4294, NULL, NULL, 0.1000, 0.15)
  ON CONFLICT DO NOTHING
;

To calculate the cost for a day, we need to add the daily supply charge, and the amount of import that applies at each rate multiplied by that rate, minus the FIT (feed-in tariff) for the export amount. Because some retailers have a “pay on time discount” (which is really just a late-payment penalty, but that is another blog post waiting to happen), it’s important to note that this discount applies to the whole bill - meaning it needs to be applied after the feed-in tariff has been subtracted.

Here is a function that calculates a daily cost, given a retailer record and import and export values:

CREATE OR REPLACE FUNCTION cost(retailer RETAILER, import INTEGER, export INTEGER)
RETURNS NUMERIC AS $$

SELECT COALESCE(1 - retailer.pay_on_time_discount, 1.0) * (retailer.supply_charge +
       -- rate 3
       CASE WHEN retailer.kWh_limit_2 IS NULL THEN 0
            WHEN import <= retailer.kWh_limit_2 THEN 0
            ELSE import - retailer.kWh_limit_2
       END / 1000.0 * COALESCE(retailer.kWh_rate_3, 0) +
       -- rate 2: hours between limit 1 and limit 2
       CASE WHEN retailer.kWh_limit_1 IS NULL THEN 0
            WHEN retailer.kWh_limit_2 IS NULL THEN GREATEST(import - retailer.kWh_limit_1, 0)
            WHEN import < retailer.kWh_limit_1 THEN 0
            ELSE LEAST(import, retailer.kWh_limit_2) - retailer.kWh_limit_1
       END / 1000.0 * COALESCE(retailer.kWh_rate_2, 0) +
       -- rate 1: up to limit 1
       CASE WHEN retailer.kWh_limit_1 IS NULL THEN import
            ELSE LEAST(import, retailer.kWh_limit_1)
       END / 1000.0 * retailer.kWh_rate_1 -
       retailer.feed_in_tariff * export / 1000.0);

$$ LANGUAGE SQL STRICT;

It is a little hard to follow, and if the daily amount was negative, it decreases that value (instead of ignoring the discount), but it will do for now. I would prefer not to choose a retailer that hides a penalty behind a so-called discount anyway, so they would have to be significantly cheaper to attract my custom.

Using this function, we can now calculate the total costs for all retailers, and compare them:

SELECT retailer.name,
       '$ ' || lpad(
         TRUNC(SUM(cost(retailer, import::INTEGER, export::INTEGER)), 2)::TEXT,
         9
       ) AS cost
  FROM daily_summary, retailer
 GROUP BY retailer.name
 ORDER BY cost ASC;

There is a bunch of code in there to make the output prettier: some of that is lost by the HTML output I have used to generate this table. In an SQL view, those decimal points would all line up…

name cost
AGL Essentials $ 897.87
AGL Residential Essentials $ 916.85
Powerband Home Flat + Solar FIT $ 955.59
Lumo Basic $ 965.56
Origin Energy $ 971.44
SA Simply RAA VPP $ 978.81
Power Direct Residential Saver $ 979.84
AGL Solar Savers $ 997.31
Click Banksia Solar $ 1028.43
PowerShop Shopper Market $ 1065.24
Kogan $ 1075.74
Red Energy Living Energy Saver $ 1092.32
Energy Australia Total Plan Home $ 1100.55
Origin Energy Econnex $ 1155.60
Alinta Energy No Fuss (Single Rate) $ 1176.65
Diamond Energy Everyday Renewable Saver $ 1302.24
Blue NRG DMO - General Usage $ 1929.36

(17 rows)

Which brings me to my conclusion: my current retailer is the cheapest for my current generation and consumption patterns.


But we can go further. We can always go further.

We can look at different months, and see if it’s always the case that a given retailer is the best option. Perhaps it is worthwhile shopping around, and switching between retailers at different times of the year.

CREATE OR REPLACE VIEW monthly_cost_comparisons AS

SELECT retailer.name, 
       to_char(date, 'YYYY-MM') AS month,
       SUM(cost(retailer, import::INTEGER, export::INTEGER)) AS cost
  FROM daily_summary, retailer
  GROUP BY retailer.name, to_char(date, 'YYYY-MM');

That by itself is not that useful, but we can use the crosstab feature to view it in a more readable manner.

CREATE EXTENSION IF NOT EXISTS crosstab;

SELECT * 
  FROM crosstab(
    $$SELECT name, month, '$ ' || lpad(cost::INTEGER::TEXT, 4)
       FROM monthly_cost_comparisons ORDER BY name$$, 
    'SELECT DISTINCT month FROM monthly_summary ORDER BY month'
) AS (
    "retailer" TEXT, 
    "2019-03" TEXT,
    "2019-04" TEXT,
    "2019-05" TEXT,
    "2019-06" TEXT,
    "2019-07" TEXT,
    "2019-08" TEXT,
    "2019-09" TEXT,
    "2019-10" TEXT,
    "2019-11" TEXT,
    "2019-12" TEXT
);

Frustratingly, this function requires enumerating the columns.

retailer 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12
AGL Essentials $ 14 $ 41 $ 128 $ 200 $ 209 $ 223 $ 90 $ 1 $ 1 $ -10
AGL Residential Essentials $ 19 $ 45 $ 127 $ 196 $ 204 $ 219 $ 93 $ 10 $ 9 $ -5
AGL Solar Savers $ 15 $ 45 $ 143 $ 223 $ 233 $ 249 $ 100 $ 1 $ 0 $ -11
Alinta Energy No Fuss (Single Rate) $ 46 $ 72 $ 147 $ 216 $ 225 $ 242 $ 122 $ 47 $ 46 $ 13
Blue NRG DMO - General Usage $ 91 $ 125 $ 226 $ 329 $ 341 $ 371 $ 203 $ 103 $ 102 $ 38
Click Banksia Solar $ 20 $ 50 $ 144 $ 222 $ 232 $ 248 $ 104 $ 8 $ 8 $ -7
Diamond Energy Everyday Renewable Saver $ 47 $ 76 $ 166 $ 246 $ 256 $ 276 $ 135 $ 45 $ 45 $ 11
Energy Australia Total Plan Home $ 28 $ 57 $ 149 $ 228 $ 237 $ 255 $ 112 $ 19 $ 19 $ -2
Kogan $ 47 $ 69 $ 132 $ 191 $ 198 $ 213 $ 112 $ 49 $ 49 $ 16
Lumo Basic $ 22 $ 49 $ 134 $ 203 $ 212 $ 227 $ 98 $ 12 $ 12 $ -4
Origin Energy Econnex $ 46 $ 71 $ 143 $ 211 $ 219 $ 236 $ 120 $ 48 $ 47 $ 14
Origin Energy $ 17 $ 45 $ 136 $ 213 $ 222 $ 238 $ 98 $ 5 $ 5 $ -8
Power Direct Residential Saver $ 27 $ 52 $ 131 $ 198 $ 207 $ 222 $ 100 $ 21 $ 21 $ 1
PowerShop Shopper Market $ 44 $ 67 $ 133 $ 193 $ 201 $ 215 $ 111 $ 44 $ 44 $ 13
Powerband Home Flat + Solar FIT $ 31 $ 55 $ 126 $ 187 $ 195 $ 208 $ 98 $ 26 $ 26 $ 4
Red Energy Living Energy Saver $ 36 $ 63 $ 142 $ 212 $ 221 $ 237 $ 112 $ 32 $ 32 $ 6
SA Simply RAA VPP $ 23 $ 50 $ 135 $ 206 $ 215 $ 230 $ 99 $ 13 $ 13 $ -4

From this, I can see that nothing is really going to save me more than a few bucks a month: and this does not take into account the $200 “bonus” I get from AGL if I stick with them for a year (or something like that).

I’d also like to look at actual billing periods, because then I can see that everything matches up to what my actual bills are (and then I can add in the bonuses at the times they applied).

My billing periods went to monthly on a specific date, so from then on they are regular one-month periods, but prior to that they were a little all over the place:

CREATE OR REPLACE VIEW billing_periods AS 

SELECT 'AGL Essentials'::TEXT AS retailer,
       DATERANGE(start::DATE, (start + INTERVAL '1 month')::DATE) as period
  FROM generate_series('2019-06-24'::DATE, now()::DATE, INTERVAL '1 month') start
  
 UNION ALL
 
 VALUES ('AGL Essentials'::TEXT, '[2019-03-07,2019-06-01)'::DATERANGE),
        ('AGL Essentials', '[2019-06-01,2019-06-24)')
  
  ORDER BY period DESC;

We can model the retailer bonuses in a way that only applies them when they apply, and if we are calculating the costs for that retailer:

CREATE TABLE retailer_bonus (
  retailer TEXT, 
  billing_period DATERANGE, 
  bonus NUMERIC
);

INSERT INTO retailer_bonus(retailer, billing_period, bonus)
     VALUES ('AGL Essentials', '[2019-08-24,2019-09-24)', 25);

And now some views for calculating the actual costs for each day, and then aggregating within the billing periods (and a pretty version).

CREATE OR REPLACE VIEW actual_costs AS

SELECT retailer,
       billing_periods.period,
       daily_summary.date,
       cost(_ret, import::INTEGER, export::INTEGER),
       generation
  FROM billing_periods
 INNER JOIN retailer _ret ON (_ret.name = billing_periods.retailer)
 INNER JOIN daily_summary ON (daily_summary.date <@ billing_periods.period);
 
 
CREATE OR REPLACE VIEW billing_period_costs AS
 
SELECT retailer, 
       period, 
       SUM(cost) - COALESCE((SELECT SUM(bonus) 
                               FROM retailer_bonus 
                              WHERE billing_period = period
                                AND retailer_bonus.retailer = retailer.name), 0) AS cost, 
       SUM(generation) AS generation
  FROM actual_costs
 GROUP BY retailer, period
 ORDER BY period DESC;
 
 
CREATE OR REPLACE VIEW billing_period_costs_pretty AS 
SELECT retailer, 
       period,
       '$ ' || LPAD(TRUNC(cost, 2)::TEXT, 7, ' ') AS cost,
       LPAD(TRUNC(generation::NUMERIC / 1000, 1)::TEXT, 6, ' ') || ' kWh' AS generation
  FROM billing_period_costs
 ORDER BY period DESC;
 

And now, we can look at the actual bills, with generation for that period:

 SELECT * FROM billing_period_costs_pretty;
 
retailer period cost generation
AGL Essentials [2019-11-24,2019-12-24) $ -9.50  595.4 kWh
AGL Essentials [2019-10-24,2019-11-24) $ -1.20  879.0 kWh
AGL Essentials [2019-09-24,2019-10-24) $ 12.80  757.3 kWh
AGL Essentials [2019-08-24,2019-09-24) $ 104.30  610.6 kWh
AGL Essentials [2019-07-24,2019-08-24) $ 224.27  374.2 kWh
AGL Essentials [2019-06-24,2019-07-24) $ 210.24  302.8 kWh
AGL Essentials [2019-06-01,2019-06-24) $ 148.31  244.8 kWh
AGL Essentials [2019-03-07,2019-06-01) $ 183.65 1495.9 kWh

Note that the first row is the current month: it’s only partway through, so I will end up with even more of a credit than that, with any luck.

These values are all within a dollar or two of the amounts I was actually billed, so that is nice.


But wait, there is more!

We can use this, with just a couple more functions, to make fairly good estimates of what our outputs and costs would be if we had more panels.

But that one will have to wait for another day.

Merging Adjacent Ranges in Postgres

Previously, I detailed a solution to split/trim/replace overlapping items in a table. Subsequently, I decided I needed to merge all adjacent items that could be merged. In this case, that was with two other fields (only one of which was subject to the exclusion constraint) being identical in adjacent periods.

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE team_membership (
  membership_id SERIAL,
  player_id INTEGER,
  team_id INTEGER,
  period DATERANGE,
  CONSTRAINT prevent_overlapping_memberships EXCLUDE USING gist(player_id WITH =, period WITH &&)
);

Before we can implement the plpgsql trigger function, we need to tell Postgres how to aggregate ranges:

CREATE AGGREGATE sum(anyrange) (
  stype = anyrange,
  sfunc = range_union
);

We should note at this point that range_union, or + (hence the reason I’ve called it SUM) will fail with an error if the two ranges that are being combined do not overlap or touch. We must make sure that in any queries where we are going to use it, that all of the ranges will overlap (and I believe they also must be in “order”, so that as we perform the union on each range in a “reduce” manner we never end up with non-contiguous ranges).

So, let’s look at the trigger function. Initially, I wrote this as two queries:

CREATE OR REPLACE FUNCTION merge_adjacent()
  RETURNS TRIGGER AS $$
  BEGIN
    NEW.period = (SELECT SUM(period) FROM (SELECT NEW.period UNION ALL ...));
    DELETE FROM team_membership ...;
    RETURN NEW;
  END;
  $$ LANGUAGE plpgsql STRICT;

This required me to duplicate the WHERE clauses, and was messy.

Then I remembered you can use the RETURNING clause, and use a CTE, with a SELECT INTO:

CREATE OR REPLACE FUNCTION merge_adjacent()
  RETURNS TRIGGER AS $$

  BEGIN
    WITH matching AS (
      DELETE FROM team_membership mem
            WHERE mem.player_id = NEW.player_id
              AND mem.team_id = NEW.team_id
              AND (mem.period -|- NEW.period OR mem.period && NEW.period)
              AND mem.membership_id <> NEW.membership_id
        RETURNING period
    )
    SELECT INTO NEW.period (
      SELECT SUM(period) FROM (
        SELECT NEW.period
         UNION ALL
        SELECT period FROM matching
        ORDER BY period
    ) _all
    );
    RETURN NEW;
  END;

  $$ LANGUAGE plpgsql STRICT;

CREATE TRIGGER merge_adjacent
BEFORE INSERT OR UPDATE ON team_membership
FOR EACH ROW EXECUTE PROCEDURE merge_adjacent();

The other thing to note about this construct is that it will only work on “already merged” data: if you had ranges:

[2019-01-01, 2019-01-04)
[2019-01-04, 2019-02-02)
# Note there is a gap here...
[2019-05-01, 2019-05-11)
[2019-05-11, 2020-01-01)

and you added in a value to the missing range:

INSERT INTO range (period) VALUES ('[2019-02-02, 2019-05-01)')

You would not merge all of the ranges, only those immediately adjacent. That is, you would wind up with rows:

[2019-01-01, 2019-01-04)
[2019-01-04, 2019-05-11)
[2019-05-11, 2020-01-01)

However, if this trigger is active on the table you would never get to the stage where your data was adjacent but not merged.


This post was updated on 2021-01-04 to ensure that updates to a single row will not result in an error.

Infinite Recursion in Postgres

I’m not sure how useful it is, but it turns out it is possible to create a view with an infinite number of rows in postgres:

WITH year AS (
  SELECT 2000 AS year

  UNION ALL

  SELECT year + 1 FROM year
)
SELECT *
  FROM year;

As this stands it really isn’t useful, because it probably won’t start returning rows to the user. However, if you don’t know how many rows you will be generating, you could do something like:

WITH year AS (
  SELECT 2000 AS year

  UNION ALL

  SELECT year + 1 FROM year
)
SELECT *
  FROM year
 LIMIT %s;

Again, this may not seem to be that useful, as you could just use a generate_series(2000, 2000 + %s, 1). But I’m currently working on something that doesn’t always have a fixed count or interval (implementing RRULE repeats in SQL), and I think that maybe this might just be useful…

Graphs in Django and Postgres

I have written a bunch of posts about dealing with trees in Postgres and Django, and Funkybob used some of this to start the package django-closure-view.

Today, someone was looking for similar functionality, but for a graph. Specifically, a Directed Acyclic Graph. Now, not every graph, or even every DAG is a tree, but every tree is a DAG.

So, the difference between a tree and a graph in this context is that a given node may have an arbitrary number of parents. But, and this is worth noting now, none of it’s parents may also be dependencies.

The first part of this tells us that we can no longer just use a simple self-relation in our model to store the relationship: because there could be multiple parents. Instead, we will need to have a many-to-many relation to store that.

from django.db import models


class Node(models.Model):
    node_id = models.AutoField(primary_key=True)
    name = models.TextField(unique=True)
    parents = models.ManyToManyField(
        'self',
        related_name='children',
        symmetrical=False,
    )

We can put some meaningful data into this graph to make it a little more obvious if our queries are sane:

django, pytz, sqlparse, asgiref = Node.objects.bulk_create([
    Node(name='django'),
    Node(name='pytz'),
    Node(name='sqlparse'),
    Node(name='asgiref'),
])

django.parents.add(pytz, sqlparse, asgiref)

graph_demo, psycopg2 = Node.objects.bulk_create([
    Node(name='graph_demo'),
    Node(name='psycopg2')
])

graph_demo.parents.add(psycopg2, django)

Let’s have a bit of a look at some of the queries we might need to think about.

-- All root nodes
SELECT node_id, name
  FROM graph_node
  LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
 WHERE to_node_id IS NULL;

As expected, this gives us back all packages that have no dependencies (parents):

 node_id │   name
─────────┼──────────
       6 │ psycopg2
       2 │ pytz
       4 │ asgiref
       3 │ sqlparse
(4 rows)

And now, all packages which are not depended upon by any other packages (no parents):

SELECT node_id, name
  FROM graph_node
  LEFT OUTER JOIN graph_node_parents ON (node_id = to_node_id)
 WHERE from_node_id IS NULL;

We should only have one package here: graph_demo.

From each of these, we can build up a recursive query to get all descendants, or all ancestors of each root/leaf node.

WITH RECURSIVE ancestors AS (
  SELECT node_id, '{}'::INTEGER[] AS ancestors
    FROM graph_node
    LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
   WHERE to_node_id IS NULL

   UNION

  SELECT node.from_node_id,
         ancestors.ancestors || ancestors.node_id
    FROM ancestors
   INNER JOIN graph_node_parents node
           ON (ancestors.node_id = to_node_id)
) SELECT * FROM ancestors;

From here, we can annotate on the names to double check:

WITH RECURSIVE ancestors AS (
  SELECT node_id, '{}'::INTEGER[] AS ancestors
    FROM graph_node
    LEFT OUTER JOIN graph_node_parents ON (node_id = from_node_id)
   WHERE to_node_id IS NULL

   UNION

  SELECT node.from_node_id,
         ancestors.ancestors || ancestors.node_id
    FROM ancestors
   INNER JOIN graph_node_parents node
           ON (ancestors.node_id = to_node_id)
)
SELECT node_id,
       node.name,
       ancestors,
       ARRAY(SELECT name
               FROM unnest(ancestors) node_id
              INNER JOIN graph_node USING (node_id)
       ) AS ancestor_names
  FROM ancestors
  INNER JOIN graph_node node USING (node_id);

So that has given us all ancestor chains: but what about if we just want the closure table: all ancestor/descendant pairs?

WITH RECURSIVE closure_table AS (
  SELECT from_node_id AS descendant,
         to_node_id AS ancestor
    FROM graph_node_parents

   UNION

  SELECT descendant,
         to_node_id AS ancestor
    FROM closure_table
   INNER JOIN graph_node_parents ON (from_node_id = ancestor)
)
SELECT * FROM closure_table

Okay, that was even easier than the previous query.

Once we have our closure table query, then we can look at preventing cycles.

CREATE OR REPLACE RECURSIVE VIEW
graph_closure_table (descendant, ancestor) AS (

  SELECT from_node_id AS descendant,
         to_node_id AS ancestor
    FROM graph_node_parents

   UNION

  SELECT descendant,
         to_node_id AS ancestor
    FROM graph_closure_table
   INNER JOIN graph_node_parents ON (from_node_id = ancestor)
);

And we can now use this in a function to prevent cycles

CREATE OR REPLACE FUNCTION prevent_cycles()
RETURNS TRIGGER AS $$

BEGIN
  IF EXISTS(SELECT 1
              FROM graph_closure_table
             WHERE ancestor = NEW.from_node_id
               AND descendant = NEW.to_node_id
           ) THEN
    RAISE EXCEPTION 'cycle detected';
  END IF;
  RETURN NEW;
END;

$$ LANGUAGE plpgsql STRICT;

CREATE TRIGGER prevent_cycles
BEFORE UPDATE OR INSERT ON graph_node_parents
FOR EACH ROW EXECUTE PROCEDURE prevent_cycles();

And this will prevent us from being able to set an invalid dependency relationship: ie, one that would trigger a cycle:

>>> django.parents.add(graph_demo)
Traceback (most recent call last):
  File "...django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.RaiseException: cycle detected
CONTEXT:  PL/pgSQL function prevent_cycles() line 9 at RAISE

It’s not totally ideal, but it does show how it protects against saving invalid relationships.


Interestingly, if we drop that constraint, we can still run the closure table query: it doesn’t give us an infinite loop, because the view uses a UNION instead of a UNION ALL: it’s going to drop any rows that are already in the output when it deals with each row - and since there are not an infinite number of combinations for a given set of dependencies, it will eventually return data.

So, where from here? I’m not sure. This was just something that I thought about while answering a question in IRC, and I felt like I needed to explore the idea.

Ratio Ordering in Postgres

A while ago, I had an idea about how to use a materialised view to handle ordering and pagination. Today in #postgresql on IRC, there was a discussion about using ratio ordering. Which reminded me that I also did some work on that.

It’s possible to use both of these techniques together to allow easy access to ordinal position, and more importantly, fast pagination over large data sets. We can also make use of a few other new postgres features to make things even nicer.

CREATE TABLE condition (
  condition_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  name TEXT UNIQUE NOT NULL,
  position_a INTEGER NOT NULL,
  position_b INTEGER NOT NULL,
  ratio_position NUMERIC UNIQUE GENERATED ALWAYS AS (position_a::NUMERIC / position_b::NUMERIC) STORED
);

We can add some data to make sure we have correct behaviour on those generated columns:

INSERT INTO condition (name, position_a, position_b)
VALUES ('One', 1, 1),
       ('Two', 2, 1);

We can also try adding in values that should be prevented:

-- Different position_a/position_b, but position_a / position_b clashes.
INSERT INTO condition (name, position_a, position_b)
VALUES ('Three', 2, 2);

-- Not able to divide by zero.
INSERT INTO condition (name, position_a, position_b)
VALUES ('Three', 2, 0);

Okay, it would be really neat if by default we just inserted a new value at the end of the list. However, this would require us to be able to have a DEFAULT on a pair of columns. Perhaps if we were storing these as a custom type, with a custom sequence, we could do that.

Anyway, onward. We can insert values between two others by doing some simple mathematics:

INSERT INTO condition (name, position_a, position_b)
VALUES (
  'One-point-five',
  1 + 2,  -- The sum of the previous and next items' position_a values
  1 + 1   -- The sum of the previous and next items' position_b values
);

And, we can see that we now have the sortable by the correct field:

SELECT name FROM condition ORDER BY ratio_position;
name
One
One-point-five
Two

(3 rows)

So, what about getting the ordinal positions? We can do that using a row_number() window function, but that’s only useful if we are fetching all of them.

We can re-use the trick of creating a materialised view that contains all of them:

CREATE MATERIALIZED VIEW condition_position AS

SELECT condition_id,
       row_number() OVER () AS position
  FROM condition
  ORDER BY ratio_position;

CREATE UNIQUE INDEX condition_position_condition ON condition_position(condition_id);
CREATE UNIQUE INDEX condition_position_position ON condition_position(position);

And now we can use it:

SELECT *
  FROM condition
 INNER JOIN condition_position USING (condition_id)
 ORDER BY position;

We want this to recalculate whenever any positions change:

CREATE OR REPLACE FUNCTION refresh_positions()
RETURNS TRIGGER AS $$
BEGIN
  EXECUTE 'REFRESH MATERIALIZED VIEW ' || TG_ARGV[0];
  RETURN NEW;
END;
$$ LANGUAGE plpgsql STRICT;

CREATE TRIGGER condition_refresh_positions
AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
ON condition
FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_positions('condition_position');

Now we can add more data:

INSERT INTO condition (name, position_a, position_b)
VALUES ('Three', 5, 1);

And fetch our data again:

SELECT *
  FROM condition
 INNER JOIN condition_position USING (condition_id)
 ORDER BY position;
condition_id name position_a position_b ratio_position position
1 One 1 1 1.00000000000000000000 1
2 Two 2 1 2.0000000000000000 2
5 One-point-five 3 2 1.5000000000000000 3
8 Three 5 1 5.0000000000000000 4

(4 rows)

Trust your tools, or how django's ORM bested me

Within my system, there is a complicated set of rules for determining if a person is “inactive”.

They may have been explicitly marked as inactive, or their company may have been marked as inactive. These are simple to discover and filter to only get active people:

Person.objects.filter(active=True, company__active=True)

The other clause for inactive users is if they only work at locations that have been marked as inactive. This means we can disable a location (within a company that remains active), and not have to manually deactivate the staff who only work at that location; it also means when we reactivate a location, staff will automatically be restored to an active state.

I’ve written the code several times that determines the activity status, but have never really been that happy with it. It generally degenerates into something that uses N+1 queries to discover the activity status of N people, or requires using django’s queryset.extra() method to run queries within the database.

Now, I have a cause to fetch all active staff, from the entire system. Which I had written a query to do, but it was mistakenly including staff who are only active at inactive units. I tried playing around with .extra(select={...}), but was not able to filter on the pseudo-fields that were generated.

Then, I had the idea to do the following:

active = Location.objects.active()
inactive = Location.objects.inactive()
Person.objects.filter(
  Q(locations__in=active) | ~Q(locations__in=inactive)
)

As long as the objects active and inactive are querysets, they will be lazily evaluated, and the SQL that is generated is relatively concise:

SELECT ... 
FROM "people" 
LEFT OUTER JOIN "people_locations" 
ON ("people"."id" = "people_locations"."person_id") 
WHERE (
  "people_locations"."location_id" IN (
    SELECT U0."id" FROM "location" U0 WHERE U0."status" = 0
  )
  OR NOT ((
    "people"."id" IN (
      SELECT U1."person_id" FROM "people_locations" U1 WHERE (
        U1."location_id" IN (
          SELECT U0."id" FROM "location" U0 WHERE U0."status" = 1
        )
        AND U1."person_id" IS NOT NULL
      )
    ) 
    AND "people"."id" IS NOT NULL)
  )
)
ORDER BY "..." ASC

This is much better than how I had previously done it, and has the bonus of being db-agnostic: wheras my previous solution used Postgres ARRAY types to aggregate the statuses of locations into a list.

The moral of the story: trust your high-level abstraction tools, and use them first. If you still have performance issues, then look at optimising.

Bowled over by Postgres

There’s a nice article at Bowled Over by SQL Window Functions by a chap called Dwain Camps. It’s written from the perspective of T-SQL, which has some differences to Postgres’s DDL and querying. I’ve reworked his stuff into what feels nice for me from a Postgressy perspective.

I’ll recap a couple of things he mentions, but you’ll probably want to head there and read that first.

  • Strike: all pins knocked down on the first ball of a frame. Scores 10 (from this frame), plus the total of whatever the next two balls knock down.
  • Spare: all pins knocked down on the second ball of a frame. Scores 10 (from this frame), plus how ever many pins you knock down on the next ball.
  • Open: at least one pin remains standing at the end of the frame. Score is how ever many pins you knocked down.

By convention, only the running tally is shown on the scoresheet. I’ve kept the frame score as the score for this frame only, and the total will contain the running total.


The first thing I do a bit differently is to use Postgres’ DOMAIN structures, which enables us to remove some of the check constraints, and simplify some others:

CREATE SCHEMA bowling;

CREATE DOMAIN bowling.frame_number AS integer
  CHECK ('[1,10]'::int4range @> VALUE)
  NOT NULL;

CREATE DOMAIN bowling.ball AS integer
  CHECK ('[0,10]'::int4range @> VALUE);

So, now we have two integer domains: the frame number may be between 1 and 10, and the ball pin count may be null, or between 0 and 10.

We’ll start by recreating the table structure: initially without constraints:

CREATE TABLE bowling.frame
(
  game_id INTEGER NOT NULL,
  player_id INTEGER NOT NULL,
  frame bowling.frame_number NOT NULL,
  ball1 bowling.ball NOT NULL,
  ball2 bowling.ball NULL,
  ball3 bowling.ball NULL,
  PRIMARY KEY (game_id, player_id, frame)
);

Not much different to the original, other than using those fresh new domain types.

The other approach I’ve used is to use more constraints, but make them simpler. I’m also relying on the fact that X + NULL => NULL, which means we can leave off a heap of the constraint clauses.

We’ll start by preventing the (non-final) frames from exceeding the number of available pins. In the case of final frame, we still only allow a spare unless we have a strike already.

ALTER TABLE bowling.frame
ADD CONSTRAINT max_spare_unless_frame_10_strike CHECK
(
  ball1 + ball2 <= 10 OR (frame = 10 AND ball1 = 10)
);

This is as simple as it can be. Because ball 2 may be null but ball 1 may not, and each ball must be no greater than 10, this is enough to encapsulate the requirement. There is one slightly incorrect circumstance: a value of (10, 0) would be valid, which is strictly incorrect (ball 2 was never bowled). In the case of the calculations it’s all correct, but if a 0 was bowled immediately after a strike, it may be possible to insert that as ball 2, which would be misleading.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_2_never_bowled_after_strike CHECK
(
  ball2 IS NULL OR ball1 < 10 OR frame = 10
);

We can now prevent ball 3 from being set unless we are on frame 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_3_only_in_frame_10 CHECK
(
  ball3 IS NULL OR frame = 10
);

A follow-up to the previous constraint: we only get to bowl ball 3 if we have bowled ball 2, and scored a strike or spare already. Note, we may have a strike on the first ball, which means we might have more than 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_only_if_eligible CHECK
(
  ball3 IS NULL OR (ball2 IS NOT NULL AND ball1 + ball2 >= 10)
);

Finally, we have some specific allowable conditions for that last ball. We already know that we must have scored a strike or spare with the first two balls, but we need to know how many pins are available to us.

If we scored a spare or two strikes, then we may have any number up to 10 to score now. Otherwise, we have however many pins were left by ball 2 (which means ball 1 must have been a strike).

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_max_spare_or_strike CHECK
(
  ball2 + ball3 <= 10
  OR
  ball1 + ball2 = 20
  OR
  ball1 + ball2 = 10
);

I find those constraints much easier to read that the original ones.


I’ve written a view, that uses a couple of Common Table Expressions (CTEs), as well as the window functions Dwain discussed.

CREATE OR REPLACE VIEW bowling.frame_score AS (
  WITH pin_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      -- Get the first ball from the next frame.
      -- Used by strike and spare.
      LEAD(ball1, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_1,
      -- Get the second ball from the next frame.
      -- Used by strike.
      LEAD(ball2, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_2,
      -- Get the first ball from the next next frame.
      -- Used by double-strike.
      LEAD(ball1, 2) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_next_ball_1
    FROM bowling.frame
  ),
  frame_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      CASE
      -- We will start with frame 10: when we have a strike
      -- or spare, we get all three balls.
      WHEN frame = 10 AND ball1 + ball2 >= 10 THEN
        ball1 + ball2 + ball3
      -- On a strike, we get the next two balls. This could
      -- be from the next frame, or include the first ball
      -- of the frame after that. Note that in frame 9, we will
      -- also look at the second ball from fram 10, rather than
      -- looking for a non-existent frame 11.
      WHEN ball1 = 10 THEN
        ball1 + next_ball_1 + (
          CASE WHEN next_ball_1 = 10 AND frame < 9 THEN
            next_next_ball_1
          ELSE
            next_ball_2
          END
        )
      -- In the case of a spare, grab the next ball. We already
      -- handled a spare on frame 10 above.
      WHEN ball1 + ball2 = 10 THEN
        ball1 + ball2 + next_ball_1
      -- Otherwise, it's just the two balls we bowled in this frame.
      ELSE
        ball1 + ball2
      END AS score
    FROM pin_counts
  )

  -- We have everything we need in the previous CTE, except that
  -- shows us the frame score, rather than the running tally.
  -- We need to do that in another window function here, but
  -- only calculate a value when the frame's wave function has
  -- collapsed (ie, it's score is known).
  SELECT
    frame_counts.*,
    CASE WHEN score IS NOT NULL THEN
      SUM(score) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
        ROWS UNBOUNDED PRECEDING
      )
    ELSE NULL END AS total
  FROM frame_counts
);

Again, I think this is a simpler query, and easier to read. But, I guess I wrote it.

We can insert the same data as used there, and look at our results:

-- Game 1
INSERT INTO bowling.frame VALUES
  (1, 1, 1, 7, 2, NULL),
  (1, 1, 2, 3, 7, NULL),
  (1, 1, 3, 6, 4, NULL),
  (1, 1, 4, 10, NULL, NULL),
  (1, 1, 5, 10, NULL, NULL),
  (1, 1, 6, 10, NULL, NULL),
  (1, 1, 7, 9, 1, NULL),
  (1, 1, 8, 10, NULL, NULL),
  (1, 1, 9, 8, 1, NULL),
  (1, 1, 10, 6, 3, NULL);

-- Game 2
INSERT INTO bowling.frame VALUES
  (2, 1, 1, 10, NULL, NULL),
  (2, 1, 2, 3, 7, NULL),
  (2, 1, 3, 10, NULL, NULL),
  (2, 1, 4, 6, 4, NULL),
  (2, 1, 5, 10, NULL, NULL),
  (2, 1, 6, 9, 1, NULL),
  (2, 1, 7, 10, NULL, NULL),
  (2, 1, 8, 8, 2, NULL),
  (2, 1, 9, 10, NULL, NULL),
  (2, 1, 10, 7, 3, 10);

-- Game 3
INSERT INTO bowling.frame VALUES
  (3, 1, 1, 10, NULL, NULL),
  (3, 1, 2, 10, NULL, NULL),
  (3, 1, 3, 10, NULL, NULL),
  (3, 1, 4, 10, NULL, NULL),
  (3, 1, 5, 10, NULL, NULL),
  (3, 1, 6, 10, NULL, NULL),
  (3, 1, 7, 10, NULL, NULL),
  (3, 1, 8, 10, NULL, NULL),
  (3, 1, 9, 10, NULL, NULL),
  (3, 1, 10, 10, 10, 10);
$ SELECT * FROM frame_score;

 game_id | player_id | frame | ball1 | ball2  | ball3  | score | total
---------+-----------+-------+-------+--------+--------+-------+-------
       1 |         1 |     1 |     7 |      2 | <NULL> |     9 |     9
       1 |         1 |     2 |     3 |      7 | <NULL> |    16 |    25
       1 |         1 |     3 |     6 |      4 | <NULL> |    20 |    45
       1 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |    75
       1 |         1 |     5 |    10 | <NULL> | <NULL> |    29 |   104
       1 |         1 |     6 |    10 | <NULL> | <NULL> |    20 |   124
       1 |         1 |     7 |     9 |      1 | <NULL> |    20 |   144
       1 |         1 |     8 |    10 | <NULL> | <NULL> |    19 |   163
       1 |         1 |     9 |     8 |      1 | <NULL> |     9 |   172
       1 |         1 |    10 |     6 |      3 | <NULL> |     9 |   181
       2 |         1 |     1 |    10 | <NULL> | <NULL> |    20 |    20
       2 |         1 |     2 |     3 |      7 | <NULL> |    20 |    40
       2 |         1 |     3 |    10 | <NULL> | <NULL> |    20 |    60
       2 |         1 |     4 |     6 |      4 | <NULL> |    20 |    80
       2 |         1 |     5 |    10 | <NULL> | <NULL> |    20 |   100
       2 |         1 |     6 |     9 |      1 | <NULL> |    20 |   120
       2 |         1 |     7 |    10 | <NULL> | <NULL> |    20 |   140
       2 |         1 |     8 |     8 |      2 | <NULL> |    20 |   160
       2 |         1 |     9 |    10 | <NULL> | <NULL> |    20 |   180
       2 |         1 |    10 |     7 |      3 |     10 |    20 |   200
       3 |         1 |     1 |    10 | <NULL> | <NULL> |    30 |    30
       3 |         1 |     2 |    10 | <NULL> | <NULL> |    30 |    60
       3 |         1 |     3 |    10 | <NULL> | <NULL> |    30 |    90
       3 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |   120
       3 |         1 |     5 |    10 | <NULL> | <NULL> |    30 |   150
       3 |         1 |     6 |    10 | <NULL> | <NULL> |    30 |   180
       3 |         1 |     7 |    10 | <NULL> | <NULL> |    30 |   210
       3 |         1 |     8 |    10 | <NULL> | <NULL> |    30 |   240
       3 |         1 |     9 |    10 | <NULL> | <NULL> |    30 |   270
       3 |         1 |    10 |    10 |     10 |     10 |    30 |   300
(30 rows)

Building the average scores for a player is likewise similar. Because I’m using a VIEW, I can jut reference that.

SELECT
  player_id,
  AVG(total) as average
FROM frame_score
WHERE frame=10
GROUP BY player_id;
 player_id |       average
-----------+----------------------
         1 | 227.0000000000000000
(1 row)

I’m fairly sure I’ve rewritten the constraints correctly, but may have missed some. Here are some of the condition tests that show invalid constraints:

$ INSERT INTO bowling.frame VALUES(1, 2, 0, 9, NULL, NULL);
ERROR:  value for domain frame_number violates check constraint "frame_number_check"
Time: 0.405 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 11, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.215 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, -1, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.218 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 3, NULL);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 1, 8, 3, null).
Time: 0.332 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 1, 1);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 1, 8, 1, 1).
Time: 0.392 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 2, 1);
ERROR:  new row for relation "frame" violates check constraint "ball_3_only_in_frame_10"
DETAIL:  Failing row contains (1, 2, 1, 8, 2, 1).
Time: 0.327 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 3, 1);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 10, 8, 3, 1).
Time: 0.340 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 2, 11);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.200 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, NULL, 10);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 10, 10, null, 10).
Time: 0.316 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 5, 10, 0, NULL);
ERROR:  new row for relation "frame" violates check constraint "ball_2_never_bowled_after_strike"
DETAIL:  Failing row contains (1, 2, 5, 10, 0, null).
Time: 0.307 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, 2, 9);
ERROR:  new row for relation "frame" violates check constraint "ball3_max_spare_or_strike"
DETAIL:  Failing row contains (1, 2, 10, 10, 2, 9).
Time: 0.323 ms

Finally, I rewrote the pretty printer. It’s not quite perfect (I don’t like how I get the plus signs at the newline character), but it is workable:

WITH symbols AS (
  SELECT
    game_id, player_id, frame,
    CASE WHEN ball1 = 10 THEN 'X' ELSE ball1::text END as ball1,
    CASE WHEN ball2 IS NULL THEN ' '
         WHEN ball1 + ball2 = 10 THEN '/'
         WHEN ball1 = 10 AND ball2 = 10 THEN 'X'
         ELSE ball2::text
         END as ball2,
    CASE WHEN ball3 IS NULL THEN ' '
    WHEN ball3 = 10 THEN 'X'
    WHEN ball3 + ball2 = 10 THEN '/'
    ELSE ball3::text
    END as ball3,
    lpad(total::text, 5, ' ') as total
  FROM
    frame_score
  ORDER BY game_id, player_id, frame
), grouped_data AS (
  SELECT
    game_id,
    player_id,
    array_agg(ball1) ball1,
    array_agg(ball2) ball2,
    array_agg(ball3) ball3,
    array_agg(total) total
  FROM
    symbols
  GROUP BY
    game_id, player_id
)
SELECT
  game_id,
  player_id,
  ball1[1] || ' | ' || ball2[1] || ' ' || chr(10) || total[1] AS "1",
  ball1[2] || ' | ' || ball2[2] || ' ' || chr(10) || total[2] AS "2",
  ball1[3] || ' | ' || ball2[3] || ' ' || chr(10) || total[3] AS "3",
  ball1[4] || ' | ' || ball2[4] || ' ' || chr(10) || total[4] AS "4",
  ball1[5] || ' | ' || ball2[5] || ' ' || chr(10) || total[5] AS "5",
  ball1[6] || ' | ' || ball2[6] || ' ' || chr(10) || total[6] AS "6",
  ball1[7] || ' | ' || ball2[7] || ' ' || chr(10) || total[7] AS "7",
  ball1[8] || ' | ' || ball2[8] || ' ' || chr(10) || total[8] AS "8",
  ball1[9] || ' | ' || ball2[9] || ' ' || chr(10) || total[9] AS "9",
  ball1[10] || ' | ' || ball2[10] || ' | ' || ball3[10] || ' ' || chr(10) || lpad(total[10], 9, ' ') AS "10"
FROM grouped_data;
 game_id | player_id |   1    |   2    |   3    |   4    |   5    |   6    |   7    |   8    |   9    |     10
---------+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+------------
       1 |         1 | 7 | 2 +| 3 | / +| 6 | / +| X |   +| X |   +| X |   +| 9 | / +| X |   +| 8 | 1 +| 6 | 3 |   +
         |           |     9  |    25  |    45  |    75  |   104  |   124  |   144  |   163  |   172  |       181
       2 |         1 | X |   +| 3 | / +| X |   +| 6 | / +| X |   +| 9 | / +| X |   +| 8 | / +| X |   +| 7 | / | X +
         |           |    20  |    40  |    60  |    80  |   100  |   120  |   140  |   160  |   180  |       200
       3 |         1 | X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X | X | X +
         |           |    30  |    60  |    90  |   120  |   150  |   180  |   210  |   240  |   270  |       300
(3 rows)

That will do for now. Corrections welcome!

slugify() for postgres (almost)

A recent discussion in #django suggested “what we need is a PG slugify function”.

The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.

Function slugify(value, allow_unicode=False).

  • Convert to ASCII if allow_unicode is false
  • Remove characters that aren’t alphanum, underscores, hyphens
  • Strip leading/trailing whitespace
  • Convert to lowercase
  • Convert spaces to hyphens
  • Remove repeated hyphens

(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).

We can write an SQL function that uses the Postgres unaccent extension to get pretty close:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\w\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";

$$ LANGUAGE SQL STRICT IMMUTABLE;

I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:

CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$

  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\w\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\s]+', '-', 'gi'
  ) FROM "normalized";

$$ LANGUAGE SQL STRICT IMMUTABLE;

To get the default value for the second argument, we can have an overloaded version with only a single argument:

CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;

Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:

BEGIN;

SELECT plan(7);

SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');

SELECT * FROM finish();

ROLLBACK;

And we get one failing test:

=# SELECT is(slugify('你好', true), '你好');

          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)

Time: 2.004 ms

It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.

Adding JSON operators to PostgreSQL

Don’t use the code in this post. Instead, read: Adding JSON(B) operators to PostgreSQL

Notably, the new JSONB data type in postgres is missing some of the features that the hstore data type has had for some time.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

However, using two awesome features of Postgres, it’s possible to add these operators in.

Firstly, python as a language in postgres:

matt=# CREATE LANGUAGE plpythonu;

Then, you can write functions in python, that import standard system libraries, like json.

CREATE OR REPLACE FUNCTION json_subtract(json_object json, key text) RETURNS json AS $body$
import json
data = json.loads(json_object)
data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

Finally, you can now overload the operator to get the syntactic sugar you want:

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text,
  PROCEDURE = json_subtract
);

Now, you can use the same syntax as for hstore:

matt=# SELECT '{"a":1, "b":2}'::json - 'a'::text;
 ?column?
----------
 {"b": 2}
(1 row)

It’s possible to repeat these for the other subtraction operators:

CREATE OR REPLACE FUNCTION json_subtract(json_object json, keys text[]) RETURNS json AS $body$
import json
data = json.loads(json_object)
for key in keys:
    data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text[],
  PROCEDURE = json_subtract
);

CREATE OR REPLACE FUNCTION json_subtract(json_object json, pairs json) RETURNS json AS $body$
import json
data = json.loads(json_object)
pairs_data = json.loads(pairs)
for key,value in pairs_data.items():
  if key in data and data[key] == value:
    data.pop(key)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = json,
  PROCEDURE = json_subtract
);

I’ll leave it as an exercise to write functions for the other operators.