Blog - Schinckel.net

Codeship PR checker

2018-09-14 @ 10:16:47

here.

Continuous Integration (CI) is great: basically you have a system that ensures that every time a change is committed your full test suite is run, and any failures are reported back. In our case, we have tests that take around 40 minutes to run (because we have lots of tests that need to create data and then ensure that things work based on that data), so being able to have that happen while you continue work is really nice. On top of this, we use Codecov to check that coverage does not decrease on a given commit.

Our general workflow is that we create a branch (in Mercurial, branches are different to git branches, they are long-lived, and all code within a branch retains that reference, meaning it’s easy to associate code with Jira issues), and create a Pull Request when the code is ready for review. At any one time there may be several Pull Requests open, that may or may not affect the same files.

Tests are run by Codeship (our CI service) against every commit, so it’s easy to see if a given commit is valid for merging, but there’s no way to know if a commit will (a) merge cleanly, and (b) still pass tests after merging, at least until you merge it. (a) is handled by BitBucket (it shows us if a merge will apply cleanly), but (b) is still a problem.

Until yesterday.

I was able to leverage the “Test Pipelines” feature of Codeship to make a pipeline that checks if the current branch is default, and if it is not, then it attempts an automated merge, followed by running all of the tests. It doesn’t send results to Codecov, because the commit that would be covered does not exist yet, but it does report errors if it fails to merge, or fails to run tests correctly.

if [ "$CI_BRANCH" != "default" ] ; then hg merge -r default --tool internal:merge; detox; fi

Notes: we use mercurial, so the branch name is based on that, as is the merge command. We also use tox, and detox to run tests in parallel.

This catches some big issues we had, where two migrations in one app were created in different branches, and the migrations framework is unable to deal with that. It could also pick up other issues, where a merge results in code that does not work, even though both commits contained fully passing code.

It’s not quite PR testing, because if the target revision changes (ie, a different PR is merged), then it should run the merge-checking pipeline again, but I’m not sure how to trigger this.

HomeKit Pairing Issues (HAP-python)

2018-09-09 @ 22:16:56

HomeKit
HAP-python

here.

There were quite a few changes to HAP-python that I hadn’t kept up with in my MQTT to HomeKit bridge, but after restarting my computer, I must have updated the installed version in that package, and all sorts of things stopped working.

I spent some time getting code to actually execute again, but had an issue where it was still not working. All of the code was running as expected, but HomeKit was just failing to recognise anything. So, I unpaired and attempted to re-pair.

It failed to pair.

Well, technically, it paired, but then unpaired immediately.

It turns out that if the JSON data that is sent to HomeKit in invalid (semantically, in this case: it was valid JSON data, just not quite valid HAP data), then it will unpair - if the device is already paired, it will just appear as unavailable.

I had some custom code that built up the Information Services slightly differently, but my method of ignoring the standard HAP-python code that added this seemed to no longer work, so my bridge, and all of my accessories had two Information Services.

Fixing this meant that I was able to pair correctly again.

MQTT HomeKit Bridge

2018-05-26 @ 22:35:38

here.

Writing HomeKit devices is possible (and even simple) using tools like HAP-python. However, devices like the esp8622 are slow to do the handshake stuff, and having to keep them awake to read temperature or other data on demand means you can’t use the deep sleep features.

These IoT devices can, however, quite easily handle publishing to an MQTT topic.

I’ve read most of the HomeKit Accessory Protocol spec (at least, the non-commercial one, but you’ll still need credentials to view that link), and I think I have a pretty good handle on it. And it occurred to me that it should be possible to bridge, in both directions, an MQTT broker and HomeKit.

Basically, you can then have a single bridge device (that you only need to register in HomeKit once), and have this connect to your MQTT broker. It can then perform two actions:

Listen for MQTT messages that meet certain criteria, and pass these through to HomeKit
Listen for HomeKit messages, and convert these into MQTT messages.

There’s a bit more to it than that: it keeps track of what devices are known, and will automatically add new devices when it detects one (via a matching MQTT topic). It could also remove devices that have not been seen for some time (or when a specific message indicates that device is no longer available).

I’ve chosen to make this as simple as possible - at this stage of my prototype there is no authentication in the MQTT broker, but that will have to change before I hook up anything other than temperature sensors. My Garage Door opener is still a standalone HomeKit device!

So, down to the nuts and bolts.

A message that matches the following pattern will be processed:

HomeKit/<device_id>/<service_type>/<characteristic_name>

For instance, I can currently see some messages that look like:

HomeKit/esp8266_12345678/TemperatureSensor/CurrentTemperature 20
HomeKit/esp8266_12345678/HumiditySensor/CurrentRelativeHumidity 58
HomeKit/123456789ABCDEF/TemperatureSensor/CurrentTemperature 20.125
HomeKit/TEST/Switch/On 1

The thing you might notice is that two of those messages have the same device id - the bridge knows this, and will add a second service to the accessory.

To be honest, this solution seems too simple, but it has been working really well for me for some time now. I have configured the sensors to send retain (persistent) messages, but I think I’m going to turn that off, except in the case of things like the switch device.

The other thing I haven’t totally nutted out yet is the authentication/authorisation stuff for MQTT. I have had some thoughts at this point though:

A device will generate a password when it first boots (and stores this).
This password will be used with the device id to authenticate with the broker.
When the client attempts to connect, a check will be made to see if the user exists - if so, the password must match. If not, the user will be created.
Any user created in this manner will be able to read and write topics that match HomeKit/<user_id>/#
A special user (the HomeKit bridge user) must be able to read and write all HomeKit/# topics.

Now that I’ve gotten OTA working with these devices, I need some mechanism for triggering this via MQTT.

OTA firmware updates with MicroPython/ESP8266

2018-05-26 @ 22:00:08

here.

It’s been a hard couple of days, but I’ve finally managed to get OTA (over-the-air) firmware updates working on an ESP8266 running MicroPython.

Building and flashing OTA firmware

This assumes you have a working ESP toolchain (maybe see this comment if you don’t yet), and have been able to build and flash custom micropython firmware.

The bootloader/OTA software is available at yaota, but the instructions for how to actually build and flash the initial OTA server and regular app are not complete. Here’s what I ended up having to do to get it to work.

Before you build, you need to generate keys:

      $ cd ota-client
      $ . gen_keys.sh
      $ python -c "import rsa_sign; rsa_sign.dump_c(rsa_sign.load_key())"

The last step outputs some data from the key - specifically you need the mod value, and this needs to be used in a config.h file at the root of the project. Duplicate the config.h.example file, and replace the MODULUS value with your public key’s modulus.
I needed to make some other changes too: until they are merged, you can see them here. They are seperate pull requests, because whilst they all need to be applied to get a build to complete, they are all independent.
Once these patches have been applied, run make at the root of the project.
Flash the generated yaota8266.bin file to location 0x0 of your device.
Flash the micropython OTA build to location 0x3c000 of your device.

You should now be able to apply OTA updates. I also included in my ota-server the patch from ulno that allows for triggering OTA from within micropython code. I added into my micropython firmware a file ota.py, that contains:

import machine

def start():
    machine.RTC().memory('yaotaota')
    machine.reset()

This means I can trigger an OTA state by doing:

import ota
ota.start()

Performing an OTA update.

In order to do this, you must have built and flashed your own yaota8266, with your own keys. Otherwise you won’t be able to sign the updates, and the ota-server binary will reject them.

Trigger an OTA state (either using the process above, or by hitting a button in the 3 seconds after the bootloader is complete).

Then, in the ota-client directory, execute the following:

$ ota_client.py live-ota <../../micropython/ports/esp8266/build/firmware-ota.bin> -a <10.9.8.160>

This will send off the 2274-ish packets to the device, and after this you should have your new firmware installed. Note you’ll need to supply your firmware file and IP address.

Notes that I came across during this process.

If you are able to flash both images correctly, but get errors from the micropython app, check that you aren’t flashing a non-ota build. In my case, the boot sequence was performing as expected, but when no OTA state was detected, and the micropython app was due to start, it was crashing with an error:

     Fatal exception (0):
     epc1=0x4020a8ac, epc2=0x00000000, e...

I was unable to get an OTA update to work on my ESP8266-01 device. This could be because it doesn’t have enough flash, or some other reason.
The sonoff device(s) I have require flashing using -fm dout mode: with other modes a write_flash will appear to succeed, but the binary will not be viable, and only garbage will be seen on the screen.
Some USB-serial converters do not provide enough current to power an ESP8266 running in anything other than flash-from-UART mode. That is, you will be able to flash them, but not boot into regular apps.
Having a USB-serial converter that has RTS/DTR means you can have esptool.py automatically trigger boot mode. This is nice. It’s a shame that the sonoff devices don’t expose enough pins to hook this up though.

As an aside, my micropython builds contain the mqtt client library, and automatically set up the WiFi credentials for the network they are installed in. ESP8266 devices retain their WiFi credentials and attempt to restore their connection state after restart, so that can simplify code somewhat, and you can just edit the configuration in the modules/inisetup.py module.

Now, I just need to come up with some mechanism for triggering the OTA state in a device using MQTT, and for devices that are not continually awake (I have several sensors that use deep-sleep for some time, and then wake up, publish an MQTT message and then go back to sleep).

More JSONB querying

2018-01-12 @ 13:52:56

postgres
jsonb

here.

Occasionally, I get emails from people regarding specific queries in Postgres, usually because I have blogged about JSONB querying before.

Today, I got one: rather than just reply, I thought I’d blog about how queries could be written to solve this problem.

Our table can be a single column with JSONB data for the purposes of this.

CREATE TABLE priority (data JSONB);

We also need a bit of data to query:

INSERT INTO priority (data) VALUES (
'{
  "id": "02e32a14-904c-4153-a32b-fe8d1f1bbbe1",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 7}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "b33498b2-32f6-4575-b2cd-9e9a1ae2059d",
  "entity": "activity",
  "fields": {
    "subject": [
      {"val": "Subject", "priority": 4}
    ]
  },
  "recordStatusType": "active"
}'), (
'{
  "id": "a2d327d2-7668-4dc0-ae1d-d6144130e3ec",
  "entity": "activity",
  "fields": {
    "object": [],
    "subject": [
      {"val": "Object", "priority": 1},
      {"val": "Target", "priority": 7}
    ]
  }
}'), (
'{
  "id": "3bc8b536-00af-4fc7-881e-b88b620ac436",
  "entity": "activity",
  "fields": {
    "object": [
      {"val": "Object", "priority": 9}
    ]
  }
}'
);

The problem requires selection of the data rows where priority is greater than 5.

I’ve extended the data provided: I’m not sure if there will be multiple “fields”, but I assume so. I also assume that a match for any priority within a subject field will be required.

Lets start with a simpler version: get the records where the first fields->subject priority is greater than 5 (I’ll return just the id, to make it simpler):

SELECT data->'id'
  FROM priority
 WHERE (data#>>'{fields,subject,0,priority}')::INTEGER > 5;

 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"

This uses the #>> operator - which does a path lookup, and returns a string value, that we then cast to an integer for the comparison. Note that the path lookup differs from normal Postgres’ array indexing, in that it uses 0 as the first index, rather than 1.

But, we want to query for all rows where any subject field has a priority greater than 5.

We’ll want to use the jsonb_array_elements (which is the JSONB equivalent of unnest). We can use that to get the fields themselves:

SELECT jsonb_array_elements(data#>'{fields,subject}') FROM priority;

Note this uses the #> operator, because we still want JSONB data:

       jsonb_array_elements
──────────────────────────────────
 {"val": "Subject", "priority": 7}
 {"val": "Subject", "priority": 4}
 {"val": "Object", "priority": 1}
 {"val": "Target", "priority": 7}
(4 rows)

We can get a bit further too:

SELECT jsonb_array_elements(data#>'{fields,subject}')->'priority' FROM priority;

Indeed, we can get all the way to our boolean test:

SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER > 5 FROM priority;

 ?column?
─────────
 t
 f
 f
 t
(4 rows)

But we want the data rows themselves, not just the matching subject field, and this is not that useful. So, we can use the fact that jsonb_array_elements returns a set, and use that as a subquery in our WHERE clause, using the value operator ANY() construct:

SELECT data->'id'
  FROM priority
 WHERE 5 < ANY(SELECT (jsonb_array_elements(data#>'{fields,subject}')->>'priority')::INTEGER)

This means that we want only the records where 5 is less than any of the priority values in subject fields.

                ?column?
────────────────────────────────────────
 "02e32a14-904c-4153-a32b-fe8d1f1bbbe1"
 "a2d327d2-7668-4dc0-ae1d-d6144130e3ec"

I hope this helps, Paulo!

Django multitenancy using Postgres Row Level Security

2017-12-20 @ 22:14:56

here.

Quite some time ago, I did some experiments in using Postgres Row Level Security (RLS) from within Django.

It occurred to me that this philosophy could be used to model a multi-tenant application.

The main big problem with django-boardinghouse is that you have to apply migrations to multiple schemata. With many tenants, this can take a long time. It’s not easy to do this in a way that would be conducive to having limited downtime.

On the other hand, RLS means that the database restricts which rows of specific tables need to be shown in a given circumstance. Normally, examples of RLS show this by using a different user, but this is not necessary.

In fact, in most modern web applications, a single database user is used for all connections. This has some big benefits (in that a connection to the database can belong to a pool, and be shared by different requests). Luckily, there are other ways to have RLS applied.

One method is to use Postgres’ session variables. This is outlined quite well in Application users vs. Row Level Security. I’m going just use simple session variables, as the facility for doing this will be encapsulated, and based on a key in the Django session - which users cannot set directly. If someone has access to this (or access to setting a Postgres session variable directly, then they have enough access to do whatever they want).

There are some caveats: specifically, the Postgres user must not be a SUPERUSER, but that’s easy to sort out. We’ll be able to continue to use PGBouncers or similar, but only if we use use session pooling (not transaction pooling).

Now, mirroring the previous post, we have a few things that need to happen:

We will need some middleware that sets the (postgres) session variable.
We may want to have some mechanism for switching tenants (unless a user is tied to a single tenant).
We must have a Tenant model of some sort (because we’ll be using foreign keys to this to indicate a given row belongs to a given tenant).
We’ll want to be able to enable/force/disable RLS for a given table.
We should be able to detect the USING clause (and WITH CHECK clause) for a given table.
We must allow the user to overwrite the USING/WITH CHECK clauses for a given table.

It turns out this is much simpler than all of the things that django-boardinghouse needs to do.

It also turns out that we can cascade the USING/WITH CHECK clauses for dependent tables, but we’ll get to that. I’m not sure how well that will perform, but it might be reasonable.

Since all good projects need a clever name, I’ve chosen django-occupation for this one (as a play on multi-tenancy). Thus, you may see the name occupation used in a few places. Also, this will be a strictly Django 2.0+ (and therefore Python3) app!

Let’s start with the easy bits:

# occupation/middleware.py
def ActivateTenant(get_response):
    def middleware(request):
        connection.cursor().execute(
            'SET occupation.active_tenant = %s',
            [request.session.get('active_tenant', '')]
        )
        return get_response(request)
    return middleware

This middleware will set a session variable. Importantly, it always sets this variable, because the rules we will be creating later rely on this being present: exceptions will result from a missing current_setting. Setting it to an empty string will mean that no rows will be returned when no tenant is selected, which is acceptable.

The code for switching tenants is a bit more complicated, and it probably needs to be. It will need some method of detecting if the given user is indeed permitted to switch to the target tenant, which could be dependent on a range of other things. For instance, in our multi-tenant application, an employee needs to be currently (or in the future) employed in order to get access, but some users may get access for other reasons (ie, a Payroll company).

We can use a view that specifically handles this, but with django-boardinghouse I also came up with a middleware that can handle this. There are, in that project, three mechanisms for switching tenants: a query parameter, an HTTP header, and a raw view. The rationalé for this was that a URL (containing a query parameter) could be used to have a permanent link to an object (which works across tenants). The drawback is that it does leak some information (about the tenant id). In practice, having this as a UUID may be nice.

Having a view that switches tenant makes doing a switch (and getting a success code if it works) easy, and having a header might make it easier for an API to switch.

Anyway, we can ignore this requirement for now.

I’ve used the same “swappable” concept in django-boardinghouse that django.contrib.auth uses for swappable user models. This has some nice side effects, but an understanding of how this works is not necessary for understanding what is about to happen next. Instead, let’s look at the definition of some models. Please keep in mind that this is a simplified example, and some parts have been omitted for clarity.

class School(models.Model):
    "This is our Tenant model."
    name = models.CharField(unique=True)

    def __str__(self):
        return self.name


class Student(models.Model):
    name = models.CharField(max_length=128)
    student_number = models.CharField(max_length=16)
    school = models.ForeignKey('School', related_name='students', on_delete=models.CASCADE)

    class Meta:
        unique_together = (
            ('school', 'student_number'),
        )

    def __str__(self):
        return self.name


class Subject(models.Model):
    name = models.CharField(unique=True, max_length=64)

    def __str__(self):
        return self.name

GRADES = [
  # ...
]


class Enrolment(models.Model):
    student = models.ForeignKey(Student, related_name='enrolments', on_delete=models.CASCADE)
    subject = models.ForeignKey(Subject, related_name='enrolments', on_delete=models.CASCADE)
    grade = models.CharField(choices=GRADES, max_length=3, null=True, blank=True)

    def __str__(self):
        if self.grade:
            return '{student} studied {subject}. Grade was {grade}.'.format(
                student=self.student.name,
                subject=self.subject.name,
                grade=self.get_grade_display(),
            )
        return '{student} is enrolled in {subject}.'.format(
            student=self.student.name,
            subject=self.subject.name,
        )

Okay, wall of code done. There are a few things to note about these models:

School is the tenant model.
Student has a direct relationship to the tenant model. This is a candidate for RLS.
Subject has no relationship to the tenant model. This is a non-tenant (ie, global) model. All instances will be visible to all users.
Enrolment has a chained relationship to the tenant model. Because of this, it’s likely that this will also be an RLS model (if the prior models in the chain have RLS restrictions).

Now a digression into some mechanics of RLS.

Enabling RLS for a given table is quite simple. We’ll do two a FORCE, because we are probably the table owner, and without FORCE, table owners may view all rows.

ALTER TABLE school_student ENABLE ROW LEVEL SECURITY;
ALTER TABLE school_student FORCE ROW LEVEL SECURITY;

In the case of a student, the user should only be able to view them if they are currently viewing the school the student belongs to:

CREATE POLICY access_tenant_data ON school_student
USING (school_id::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK (school_id::TEXT = current_setting('occupation.active_tenant'))

Notice that we used the current_setting('occupation.active_tenant') that we configured before. As I mentioned, this policy will throw an exception if the setting is not set, so our middleware sets it to an empty string - which should not match any rows.

The other thing that may look out of place is that we are coercing the school_id to a TEXT. This is because current_setting() returns a text value, even if it was set using a number.

So, what does this actually do?

It restricts the query to only rows that match the USING clause (in the case of a SELECT, UPDATE or DELETE), and then ensures that any rows that are being written (in the case of UPDATE or INSERT) meet the same restriction. This prevents a user accidentally (or on purpose) writing a row that they could not currently view.

So, that’s the SQL. Can we generate this in a nice way from our Django models?

CREATE_POLICY = '''
CREATE POLICY access_tenant_data ON {table}
USING ({fk}::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK ({fk}::TEXT = current_setting('occupation.active_tenant'))'''

def build_policy_clause(model):
    for field in model._meta.fields:
        if field.related_model is School:
            return CREATE_POLICY.format(fk=field.db_column, table=model._meta.db_table)

Again, this is simplified. It only works for a direct link, and naïvely assumes the db_column exists. In practice there’s more to it than that. But that will do for now.

So, given our knowledge of our models, we don’t need to enable RLS for our Subject model, but we want to enable it for our Enrolment model. In fact, we will need to - otherwise a user would be able to load up an Enrolment object, but not be able to see the related Student.

In fact, we use this relation (and the fact that the restriction is already applied to all queries) to make our policy for that table simpler:

CREATE POLICY access_tenant_data ON school_enrolment
USING (student_id IN (SELECT id FROM school_student))
WITH CHECK (student_id IN (SELECT id FROM school_student))

Notably, this sort of CHECK happens every time Postgres writes a FOREIGN KEY reference: we need to repeat it because FK references are not subject to RLS, but we basically want to make it so they are.

Interestingly, because of the cascading nature of this configuration, we don’t need to include the current_setting call at all, because that happens in the inner query.

However, it does concern me that this will result in more work in the database. I’ll have to run some tests on larger data sets to see how this performs.

Building up the SQL to use there is slightly more complicated: we need to look at every foreign key on the model and see which of them can trace a chain up to the tenant model. Then we’d need a clause in the USING/WITH CHECK for each of those foreign keys.

I do have some code that does this, but it’s not very pretty.

Also, I’d like to be able to come up with a way that generates this SQL using more of the ORM, but I’m not sure it’s really necessary, since the resulting code is quite simple.

As for applying these changes - the two solutions are to create a RunSQL call for each required statement, and writing this directly to the migration file, or having a migration operation that executes the SQL. I’m not sure which way I’ll drop with that just yet.

I do have a proof of concept for this up and running (code is available at django-occupation). There are still some things I want to figure out.

Cross-tenant queries are a thing in my domain - what is the best mechanism for doing this. Should there be a postgres session variable that ignores it, or could we enumerate tenants? That would allow restricted cross-tenant queries.
Just how well does this perform at scale?
How much of this stuff is not really related to multi-tenancy, and could be extracted out into a more generic RLS package?

Apple Watch

2017-09-28 @ 18:00:42

here.

I’ve been running for several years now, and one of the biggest things that keeps me engaged (especially early on, but also now) is being able to track everything about my running. To that end, I’ve had a running watch, and I overwhelmingly almost always run with it. Initially, I had a Garmin Forerunner 405, which I replaced with a Forerunner 610. I’ve recently upgraded to an Apple Watch LTE.

So far, I’m really enjoying it. It’s really nice to not carry a phone around (either when running, or just at other times), and although I never really minded a chest heart-rate strap, it’s very liberating not to be wearing one.

I have found a few things that are not quite how I would like them:

You can’t get directions to your car on your watch. You need to “Continue on iPhone”. I’m hoping this will be fixed at some point. There are a handful of other operations that are like this too (like “What is this song?”).
I’m yet to find a satisfactory workout app that provides structured workouts: interval repeats and the like, possibly nested, and with duration and targets of different types (ie, pace, heart rate, time, distance).
The Strava app seems to have fairly long intervals between samples. I always set my Forerunner to sample at 1 second, and some watches I believe will do multiple samples per second.
The iOS Bedtime feature is not available on the Apple Watch: you need to have it turned on on an iPhone; and then the alarm will trigger on there too. I’ve moved to a watch-only alarm, so as to not wake my partner when I get up early for a run.
The “Standing Hours” feature isn’t really about standing: I have a standing desk, and stand all day at work, and if I don’t move around, it thinks I’m not standing. Also, I’ve had a few times where the standing alarm seems to not fire at all.
Apple Pay transactions made on my Watch don’t trigger an alert on my Phone (or watch). However, my last payment on my phone did trigger an alert on my watch, so I’ll test that a bit more.

However, the things that I really like are tipping me overwhelmingly in favour:

Being able to leave my phone on my desk and still tracking my activity is really nice. This is the key reason I bought a Misfit Shine some time ago, but that was less useful (and I ended up losing it because their wrist strap was insane).
Unlocking my Mac just by hitting the space bar is awesome.
Siri works really well (with the caveat above). I’m adding things to my shopping list like a madman.
Likewise, dictating iMessages seems like it’s more reliable than on iOS10. I’m still unlikely to do it in a public space when there are other people around though.
I’m now totally into “Mindful Minutes”, and I’ll do this whenever I get a chance. Breathe…
Battery life has been really good: I charge it when I’m in the shower, and when I’m driving. I’ll also put it on the charge to top it up before I go to sleep (I’m using a sleep tracker) if it’s a bit low, because I’ll generally be running in the morning.
Apple Pay on my watch really feels like the future.
Did I mention that running without a phone and chest strap is quite liberating?

Interestingly, I don’t think I’ve made a phone call from my watch yet. I don’t make that many phone calls to begin with (and most of them are actually when I’m in my car).

NeMeSiS

2017-07-16 @ 19:03:36

here.

A long time ago, I was lucky enough to get selected to attend the National Mathematic Summer School, an annual camp run by the Australian National University and the Australian Association of Mathematics Teachers. Around 60 promising students who have just completed Year 11 are invited to attend a 12 day retreat at ANU in Canberra.

In the time since, I’ve been getting the bi-annual (maybe, everything blurs together) newsletter, and reading it with some interest, coupled at times with a feeling of inadequacy as I thought about how little further mathematics study I actually did.

It turns out that I attended NeMeSiS (as the students refer to it) back in 1992, and that was 25 years ago.

In the time since then, I’ve come across exactly two other alumni: one was a student with me at The Levels in 1993 (we had attended NeMeSiS at the same time, I think we met there, and we were friends for a year or two, but I can’t remember his name), and the other I met several years later through Touch Football, and several years after that discovered that we had this shared-although-offset-by-a-year history. She’s a surgeon, and we still bump into one another occasionally.

NeMeSiS was for me a real eye-opener. I went from being (I felt at the time) the smartest kid in the room at all times, to just being some kid. In some ways, I probably didn’t actually deal with it quite the right way - I knew I was no where near as smart as some of the other students (Ben Burton, for instance), I came out of it still feeling superior to everyone I studied with at school/university after that point in time. That probably explains how someone with “lots of potential” ended up failing first-year Engineering Mathematics.

I remember catching the plane from Adelaide, and I think I was seated with a group of other NeMeSiS students. They all knew one another, and I was somewhat of an outsider, as I was actually from the Victorian quota (having been at school in Hamilton, in western Victoria). I have a feeling now that there was more segregation and aloofness of some students, but I did find a home within a small group. Perhaps we were the outsiders, but I didn’t feel at the time that I was being ostracised.

After dropping out of my Engineering degree, I then went and completed an Education degree (which was much less work, I must say). I taught for nearly 10 years, and then did a graduate entry Computer Science degree. I’d taught myself almost everything that was covered in the course work of that degree, so sailed through it with mostly High Distinctions.

I hear lots of people talk about imposter syndrome, and it’s really interesting (to me, at least) that I don’t often feel that in my second career. I think maybe I did have it when I was a teacher, and I feel so much more confident about what I am doing within the scope of my work now that it doesn’t affect me so much. Maybe that’s Dunning-Kruger, but I hope not. I think it’s more about having felt, not exactly out of my depth, but like I was doing something that I was never really supposed to be doing.

Anyway, these thoughts were brought on by the arrival today of the latest newsletter, with mention of how the next one will be the 50th. I’m yet to attend one of my school reunions (both 10 and 20-year versions have passed me by), but maybe I’ll think about going to one for NeMeSiS.

Update: I found a list of “lost alumni”, and it seems that I can remember way too many names: Michael Plavins was (I think) the friend from Uni, and I know that Robert Dunbabin (Bobbit), Kathryn Vaughan and I were quite good friends. Effie Hatzigiannis, Irina Shainsky and Zoran Vukojevic are all names that I had forgotten I knew.

Tree data as a nested list redux

2017-07-01 @ 12:34:05

here.

Some time ago, I wrote about using python to aggregate data that is stored with a Materialized Path into a Nested List structure.

But we should be able to do that same aggregation using Postgres, and from an Adjacency List structure.

Let’s start with a table definition:

CREATE TABLE location (
  node_id SERIAL PRIMARY KEY,
  name TEXT,
  parent_id INTEGER REFERENCES location(node_id)
);

And some data:

INSERT INTO location (node_id, name, parent_id) VALUES
  (1, 'Australia', NULL),
  (2, 'South Australia', 1),
  (3, 'Victoria', 1),
  (4, 'South-East', 2),
  (5, 'Western Districts', 3),
  (6, 'New Zealand', NULL),
  (7, 'Barossa Valley', 2),
  (8, 'Riverland', 2),
  (9, 'South Island', 6),
  (10, 'North Island', 6),
  (11, 'Eastern Bay of Plenty', 10);

To begin with, we need to get all of the items, and their depth:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
)
SELECT * FROM location_with_level;

 node_id │         name          │ parent_id │ lvl
─────────┼───────────────────────┼───────────┼─────
       1 │ Australia             │    <NULL> │   0
       6 │ New Zealand           │    <NULL> │   0
       2 │ South Australia       │         1 │   1
       3 │ Victoria              │         1 │   1
       9 │ South Island          │         6 │   1
      10 │ North Island          │         6 │   1
       4 │ South-East            │         2 │   2
       5 │ Western Districts     │         3 │   2
       7 │ Barossa Valley        │         2 │   2
       8 │ Riverland             │         2 │   2
      11 │ Eastern Bay of Plenty │        10 │   2
(11 rows)

Because of the way recursive queries work, we need to find the deepest node(s), and start there:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
)

SELECT * FROM maxlvl;

We then need to build up the tree (this clause is the next one in our CTE chain, I’ve omitted the first two for clarity):

c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

The first part of this query gets all of the deepest leaf nodes.

This is then combined with another recursive subquery, that creates branches. This relies on the fact it’s possible use the “type”, and have records as columns in a query. The second part of this subquery finds all remaining leaf nodes, and combines them in. This second subquery will keep executing until it doesn’t find any new rows, which will happen when all root nodes have been processed.

We can see from the results of this last clause that we just need to limit this to root nodes:

 node_id │         name          │ parent_id │ lvl │                                                   children
─────────┼───────────────────────┼───────────┼─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────
       4 │ South-East            │         2 │   2 │ <NULL>
       5 │ Western Districts     │         3 │   2 │ <NULL>
       7 │ Barossa Valley        │         2 │   2 │ <NULL>
       8 │ Riverland             │         2 │   2 │ <NULL>
      11 │ Eastern Bay of Plenty │        10 │   2 │ <NULL>
       3 │ Victoria              │         1 │   1 │ [{"name": "Western Districts", "children": null}]
      10 │ North Island          │         6 │   1 │ [{"name": "Eastern Bay of Plenty", "children": null}]
       9 │ South Island          │         6 │   1 │ <NULL>
       2 │ South Australia       │         1 │   1 │ [{"name": "Riverland", "children": null}, {"name": "Barossa Valley", "children": null}, {"name": "South-East…
         │                       │           │     │…", "children": null}]
       6 │ New Zealand           │    <NULL> │   0 │ [{"name": "South Island", "children": null}, {"name": "North Island", "children": [{"name": "Eastern Bay of …
         │                       │           │     │…Plenty", "children": null}]}]
       1 │ Australia             │    <NULL> │   0 │ [{"name": "South Australia", "children": [{"name": "Riverland", "children": null}, {"name": "Barossa Valley"…
         │                       │           │     │…, "children": null}, {"name": "South-East", "children": null}]}, {"name": "Victoria", "children": [{"name": …
         │                       │           │     │…"Western Districts", "children": null}]}]
(11 rows)

So our final query, using the new jsonb_pretty function:

WITH RECURSIVE location_with_level AS (
  SELECT *,
         0 AS lvl
    FROM location
   WHERE parent_id IS NULL

  UNION ALL

  SELECT child.*,
         parent.lvl + 1
    FROM location child
    JOIN location_with_level parent ON parent.node_id = child.parent_id
),
maxlvl AS (
  SELECT max(lvl) maxlvl FROM location_with_level
),
c_tree AS (
  SELECT location_with_level.*,
         NULL::JSONB children
    FROM location_with_level, maxlvl
   WHERE lvl = maxlvl

   UNION

   (
     SELECT (branch_parent).*,
            jsonb_agg(branch_child)
       FROM (
         SELECT branch_parent,
                to_jsonb(branch_child) - 'lvl' - 'parent_id' - 'node_id' AS branch_child
           FROM location_with_level branch_parent
           JOIN c_tree branch_child ON branch_child.parent_id = branch_parent.node_id
       ) branch
       GROUP BY branch.branch_parent

       UNION

       SELECT c.*,
              NULL::JSONB
       FROM location_with_level c
       WHERE NOT EXISTS (SELECT 1
                           FROM location_with_level hypothetical_child
                          WHERE hypothetical_child.parent_id = c.node_id)
   )
)

SELECT jsonb_pretty(
         array_to_json(
           array_agg(
             row_to_json(c_tree)::JSONB - 'lvl' - 'parent_id' - 'node_id'
           )
         )::JSONB
       ) AS tree
  FROM c_tree
  WHERE lvl=0;

And our results:

                           tree
 ──────────────────────────────────────────────────────────
  [
      {
          "name": "New Zealand",
          "children": [
              {
                  "name": "South Island",
                  "children": null
              },
              {
                  "name": "North Island",
                  "children": [
                      {
                          "name": "Eastern Bay of Plenty",
                          "children": null
                      }
                  ]
              }
          ]
      },
      {
          "name": "Australia",
          "children": [
              {
                  "name": "South Australia",
                  "children": [
                      {
                          "name": "Riverland",
                          "children": null
                      },
                      {
                          "name": "Barossa Valley",
                          "children": null
                      },
                      {
                          "name": "South-East",
                          "children": null
                      }
                  ]
              },
              {
                  "name": "Victoria",
                  "children": [
                      {
                          "name": "Western Districts",
                          "children": null
                      }
                  ]
              }
          ]
      }
  ]
 (1 row)

Oh, that is rather neat.

This query is mostly cribbed from a fantastic Stack Overflow answer by David Guillot.

Django bulk_update without upsert

2017-06-30 @ 23:03:19

here.

Postgres 9.5 brings a fantastic feature, that I’ve really been looking forward to. However, I’m not on 9.5 in production yet, and I had a situation that would really have benefitted from being able to use it.

I have to insert lots of objects, but if there is already an object in a given “slot”, then I need to instead update that existing object.

Doing this using the Django ORM can be done one a “one by one” basis, by iterating through the objects, finding which one (if any) matches the criteria, updating that, or creating a new one if there wasn’t a match.

However, this is really slow, as it does two queries for each object.

Instead, it would be great to:

fetch all of the instances that could possibly overlap (keyed by the matching criteria)
iterate through the new data, looking for a match
- modify the instance if an existing match is made, and stash into pile “update”
- create a new instance if no match is found, and stash into the pile “create”
bulk_update all of the “update” objects
bulk_create all of the “create” objects

Those familiar with Django may recognise that there is only one step here that cannot be done as of “now”.

So, how can we do a bulk update?

There are two ways I can think of doing it (at least with Postgres):

create a temporary table (cloning the structure of the table)
insert all of the data into this table
update the rows in the original table from the temporary table, based on pk column

and:

come up with some mechanism of using the UPDATE the_table SET ... FROM () sq WHERE sq.pk = the_table.pk syntax

It’s possible to use some of the really nice features of Postgres to create a temporary table, that clones an existing table, and will automatically be dropped at the end of the transaction:

BEGIN;

CREATE TEMPORARY TABLE upsert_source (LIKE my_table INCLUDING ALL) ON COMMIT DROP;

-- Bulk insert into upsert_source

UPDATE my_table
   SET foo = upsert_source.foo,
       bar = upsert_source.bar
  FROM upsert_source
 WHERE my_table.id = upsert_source.id;

The drawbacks of this are that it does two extra queries, but it is possible to implement fairly simply:

from django.db import transaction, connection

@transaction.atomic
def bulk_update(model, instances, *fields):
    cursor = connection.cursor()
    db_table = model._meta.db_table

    try:
        cursor.execute(
            'CREATE TEMPORARY TABLE update_{0} (LIKE {0} INCLUDING ALL) ON COMMIT DROP'.format(db_table)
        )

        model._meta.db_table = 'update_{}'.format(db_table)
        model.objects.bulk_create(instances)

        query = ' '.join([
            'UPDATE {table} SET ',
            ', '.join(
                ('%(field)s=update_{table}.%(field)s' % {'field': field})
                for field in fields
            ),
            'FROM update_{table}',
            'WHERE {table}.{pk}=update_{table}.{pk}'
        ]).format(
            table=db_table,
            pk=model._meta.pk.get_attname_column()[1]
        )
        cursor.execute(query)
    finally:
        model._meta.db_table = db_table

The avantage of this is that it mostly just uses the ORM. There’s limited scope for SQL injection (although you’d probably want to validate the field names).

It’s also possible to do the update directly from a subquery, but without the nice column names:

UPDATE my_table
   SET foo = upsert_source.column2,
       column2 = upsert_source.column3
  FROM (
    VALUES (...), (...)
  ) AS upsert_source
 WHERE upsert_source.column1 = my_table.id;

Note that you must make sure your values are in the correct order (with the primary key first).

Attempting to prevent some likely SQL injection vectors, we want to build up the fixed parts of the query (and the parts that are controlled by the django model, like the table and field names), and then pass the values in as query parameters.

from django.db import connection

def bulk_update(model, instances, *fields):
    set_fields = ', '.join(
        ('%(field)s=update_{table}.column%(i)s' % {'field': field, 'i': i + 2})
        for i, field in enumerate(fields)
    )
    value_placeholder = '({})'.format(', '.join(['%s'] * (len(fields) + 1)))
    values = ','.join([value_placeholder] * len(instances))
    query = ' '.join([
        'UPDATE {table} SET ',
        set_fields,
        'FROM (VALUES ', values, ') update_{table}',
        'WHERE {table}.{pk} = update_{table}.column1'
    ]).format(table=model._meta.db_table, pk=model._meta.pk.get_attname_column()[1])
    params = []
    for instance in instances:
        data.append(instance.pk)
        for field in fields:
            params.append(getattr(instance, field))

    connection.cursor().execute(query, params)

This feels like a reasonable first draft, however I’d probably want to go look at how the query for bulk_create is created, and modify that. There’s a fair bit going on there that I haven’t followed as yet though. Note that this does not need the @transaction.atomic decorator, as it is only a single statement.

From here, we can build an upsert that assumes all objects with a PK need to be updated, and those without need to be inserted:

from django.utils.functional import partition
from django.db import transaction

@transaction.atomic
def bulk_upsert(model, instances, *fields):
    update, create = partition(lambda obj: obj.pk is None, instances)
    if update:
        bulk_update(model, update, *fields)
    if create:
        model.objects.bulk_create(create)