Django multitenancy using Postgres Row Level Security

2017-12-20 @ 22:14:56

here.

Quite some time ago, I did some experiments in using Postgres Row Level Security (RLS) from within Django.

It occurred to me that this philosophy could be used to model a multi-tenant application.

The main big problem with django-boardinghouse is that you have to apply migrations to multiple schemata. With many tenants, this can take a long time. It’s not easy to do this in a way that would be conducive to having limited downtime.

On the other hand, RLS means that the database restricts which rows of specific tables need to be shown in a given circumstance. Normally, examples of RLS show this by using a different user, but this is not necessary.

In fact, in most modern web applications, a single database user is used for all connections. This has some big benefits (in that a connection to the database can belong to a pool, and be shared by different requests). Luckily, there are other ways to have RLS applied.

One method is to use Postgres’ session variables. This is outlined quite well in Application users vs. Row Level Security. I’m going just use simple session variables, as the facility for doing this will be encapsulated, and based on a key in the Django session - which users cannot set directly. If someone has access to this (or access to setting a Postgres session variable directly, then they have enough access to do whatever they want).

There are some caveats: specifically, the Postgres user must not be a SUPERUSER, but that’s easy to sort out. We’ll be able to continue to use PGBouncers or similar, but only if we use use session pooling (not transaction pooling).

Now, mirroring the previous post, we have a few things that need to happen:

We will need some middleware that sets the (postgres) session variable.
We may want to have some mechanism for switching tenants (unless a user is tied to a single tenant).
We must have a Tenant model of some sort (because we’ll be using foreign keys to this to indicate a given row belongs to a given tenant).
We’ll want to be able to enable/force/disable RLS for a given table.
We should be able to detect the USING clause (and WITH CHECK clause) for a given table.
We must allow the user to overwrite the USING/WITH CHECK clauses for a given table.

It turns out this is much simpler than all of the things that django-boardinghouse needs to do.

It also turns out that we can cascade the USING/WITH CHECK clauses for dependent tables, but we’ll get to that. I’m not sure how well that will perform, but it might be reasonable.

Since all good projects need a clever name, I’ve chosen django-occupation for this one (as a play on multi-tenancy). Thus, you may see the name occupation used in a few places. Also, this will be a strictly Django 2.0+ (and therefore Python3) app!

Let’s start with the easy bits:

# occupation/middleware.py
def ActivateTenant(get_response):
    def middleware(request):
        connection.cursor().execute(
            'SET occupation.active_tenant = %s',
            [request.session.get('active_tenant', '')]
        )
        return get_response(request)
    return middleware

This middleware will set a session variable. Importantly, it always sets this variable, because the rules we will be creating later rely on this being present: exceptions will result from a missing current_setting. Setting it to an empty string will mean that no rows will be returned when no tenant is selected, which is acceptable.

The code for switching tenants is a bit more complicated, and it probably needs to be. It will need some method of detecting if the given user is indeed permitted to switch to the target tenant, which could be dependent on a range of other things. For instance, in our multi-tenant application, an employee needs to be currently (or in the future) employed in order to get access, but some users may get access for other reasons (ie, a Payroll company).

We can use a view that specifically handles this, but with django-boardinghouse I also came up with a middleware that can handle this. There are, in that project, three mechanisms for switching tenants: a query parameter, an HTTP header, and a raw view. The rationalé for this was that a URL (containing a query parameter) could be used to have a permanent link to an object (which works across tenants). The drawback is that it does leak some information (about the tenant id). In practice, having this as a UUID may be nice.

Having a view that switches tenant makes doing a switch (and getting a success code if it works) easy, and having a header might make it easier for an API to switch.

Anyway, we can ignore this requirement for now.

I’ve used the same “swappable” concept in django-boardinghouse that django.contrib.auth uses for swappable user models. This has some nice side effects, but an understanding of how this works is not necessary for understanding what is about to happen next. Instead, let’s look at the definition of some models. Please keep in mind that this is a simplified example, and some parts have been omitted for clarity.

class School(models.Model):
    "This is our Tenant model."
    name = models.CharField(unique=True)

    def __str__(self):
        return self.name


class Student(models.Model):
    name = models.CharField(max_length=128)
    student_number = models.CharField(max_length=16)
    school = models.ForeignKey('School', related_name='students', on_delete=models.CASCADE)

    class Meta:
        unique_together = (
            ('school', 'student_number'),
        )

    def __str__(self):
        return self.name


class Subject(models.Model):
    name = models.CharField(unique=True, max_length=64)

    def __str__(self):
        return self.name

GRADES = [
  # ...
]


class Enrolment(models.Model):
    student = models.ForeignKey(Student, related_name='enrolments', on_delete=models.CASCADE)
    subject = models.ForeignKey(Subject, related_name='enrolments', on_delete=models.CASCADE)
    grade = models.CharField(choices=GRADES, max_length=3, null=True, blank=True)

    def __str__(self):
        if self.grade:
            return '{student} studied {subject}. Grade was {grade}.'.format(
                student=self.student.name,
                subject=self.subject.name,
                grade=self.get_grade_display(),
            )
        return '{student} is enrolled in {subject}.'.format(
            student=self.student.name,
            subject=self.subject.name,
        )

Okay, wall of code done. There are a few things to note about these models:

School is the tenant model.
Student has a direct relationship to the tenant model. This is a candidate for RLS.
Subject has no relationship to the tenant model. This is a non-tenant (ie, global) model. All instances will be visible to all users.
Enrolment has a chained relationship to the tenant model. Because of this, it’s likely that this will also be an RLS model (if the prior models in the chain have RLS restrictions).

Now a digression into some mechanics of RLS.

Enabling RLS for a given table is quite simple. We’ll do two a FORCE, because we are probably the table owner, and without FORCE, table owners may view all rows.

ALTER TABLE school_student ENABLE ROW LEVEL SECURITY;
ALTER TABLE school_student FORCE ROW LEVEL SECURITY;

In the case of a student, the user should only be able to view them if they are currently viewing the school the student belongs to:

CREATE POLICY access_tenant_data ON school_student
USING (school_id::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK (school_id::TEXT = current_setting('occupation.active_tenant'))

Notice that we used the current_setting('occupation.active_tenant') that we configured before. As I mentioned, this policy will throw an exception if the setting is not set, so our middleware sets it to an empty string - which should not match any rows.

The other thing that may look out of place is that we are coercing the school_id to a TEXT. This is because current_setting() returns a text value, even if it was set using a number.

So, what does this actually do?

It restricts the query to only rows that match the USING clause (in the case of a SELECT, UPDATE or DELETE), and then ensures that any rows that are being written (in the case of UPDATE or INSERT) meet the same restriction. This prevents a user accidentally (or on purpose) writing a row that they could not currently view.

So, that’s the SQL. Can we generate this in a nice way from our Django models?

CREATE_POLICY = '''
CREATE POLICY access_tenant_data ON {table}
USING ({fk}::TEXT = current_setting('occupation.active_tenant'))
WITH CHECK ({fk}::TEXT = current_setting('occupation.active_tenant'))'''

def build_policy_clause(model):
    for field in model._meta.fields:
        if field.related_model is School:
            return CREATE_POLICY.format(fk=field.db_column, table=model._meta.db_table)

Again, this is simplified. It only works for a direct link, and naïvely assumes the db_column exists. In practice there’s more to it than that. But that will do for now.

So, given our knowledge of our models, we don’t need to enable RLS for our Subject model, but we want to enable it for our Enrolment model. In fact, we will need to - otherwise a user would be able to load up an Enrolment object, but not be able to see the related Student.

In fact, we use this relation (and the fact that the restriction is already applied to all queries) to make our policy for that table simpler:

CREATE POLICY access_tenant_data ON school_enrolment
USING (student_id IN (SELECT id FROM school_student))
WITH CHECK (student_id IN (SELECT id FROM school_student))

Notably, this sort of CHECK happens every time Postgres writes a FOREIGN KEY reference: we need to repeat it because FK references are not subject to RLS, but we basically want to make it so they are.

Interestingly, because of the cascading nature of this configuration, we don’t need to include the current_setting call at all, because that happens in the inner query.

However, it does concern me that this will result in more work in the database. I’ll have to run some tests on larger data sets to see how this performs.

Building up the SQL to use there is slightly more complicated: we need to look at every foreign key on the model and see which of them can trace a chain up to the tenant model. Then we’d need a clause in the USING/WITH CHECK for each of those foreign keys.

I do have some code that does this, but it’s not very pretty.

Also, I’d like to be able to come up with a way that generates this SQL using more of the ORM, but I’m not sure it’s really necessary, since the resulting code is quite simple.

As for applying these changes - the two solutions are to create a RunSQL call for each required statement, and writing this directly to the migration file, or having a migration operation that executes the SQL. I’m not sure which way I’ll drop with that just yet.

I do have a proof of concept for this up and running (code is available at django-occupation). There are still some things I want to figure out.

Cross-tenant queries are a thing in my domain - what is the best mechanism for doing this. Should there be a postgres session variable that ignores it, or could we enumerate tenants? That would allow restricted cross-tenant queries.
Just how well does this perform at scale?
How much of this stuff is not really related to multi-tenancy, and could be extracted out into a more generic RLS package?