Capture and test sys.stdout/sys.stderr in unittest.TestCase

Testing in Django is usually done using the unittest framework, which comes with Python. You can also test using doctest, with a little bit of work.

One advantage of doctest is that it’s super-easy to test for an exception: you just expect the traceback (which can be trimmed using \n ... \n).

In a unittest.TestCase, you can do a similar thing, but it’s a little more work.

Basically, you want to temporarily replace sys.stdout (or sys.stderr) with a StringIO instance, and set it back after the block you care about has finished.

Python has had a nice feature for some time called Context Managers. These enable you to ensure that cleanup code will be run, regardless of what happens in the block.

The syntax for running code within a context manager is:

with context_manager(thing) as other:
  # Code we want to run
  # Can use 'other' in here.

One place that you can see this syntax, in the context of testing using unittest is to check a specific exception is raised when a function that uses keyword arguments, or a statement that is not a callable is executed:

class FooTest(TestCase):
  def test_one_way(self):
    self.assertRaises(ExceptionType, callable, arg1, arg2)

  def test_another_way(self):
    with self.assertRaises(ExceptionType):
      callable(arg1, arg2)
      # Could also be:
      #     callable(arg1, arg2=arg2)
      # Or even:
      #     foo = bar + baz
      # Which are not possible in the test_one_way call.

So, we could come up with a similar way of calling our code that we want to capture the sys.stdout from:

class BarTest(TestCase):
  def test_and_capture(self):
    with capture(callable, *args, **kwargs) as output:
      self.assertEquals("Expected output", output)

And the context manager:

import sys
from cStringIO import StringIO
from contextlib import contextmanager

@contextmanager
def capture(command, *args, **kwargs):
  out, sys.stdout = sys.stdout, StringIO()
  try:
    command(*args, **kwargs)
    sys.stdout.seek(0)
    yield sys.stdout.read()
  finally:
    sys.stdout = out

It’s simple enough to do the same with sys.stderr.

Update: thanks to Justin Patrin for pointing out that we should wrap the command in a try:finally: block.

My own private PyPI

PyPI, formerly the CheeseShop is awesome. It’s a central repository of python packages. Knowing you can just do a pip install foo, and it looks on pypi for a package named foo is superb. Using pip requirements files, or setuptools install_requires means you can install all the packages you need, really simply.

And, the nice thing about pip is that it won’t bother downloading a package you already have installed, subject to version requirements, unless you specifically force it to. This is better than installing using pip install -e <scm>+https://... from a mercurial or git repository. This is a good reason to have published version numbers.

However, when installing into a new virtualenv, it still may take some time to download all of the packages, and not everything I do can be put onto pypi: quite a lot of my work is confidential and copyrighted by my employer. So, there is quite a lot of value to me to be able to have a local cache of packages.

You could use a shared (between all virtualenvs) --build directory, but the point of virtualenv is that every environment is isolated. So, a better option is a local cache server. And for publishing private packages, a server is required for this too. Being able to use the same workflow for publishing a private package as an open source package is essential.

Because we deploy using packages, our private package server is located outside of our office network. We need to be able to install packages from it on our production servers. However, this negates the other advantage of a pypi cache. It does mean we control all of the required infrastructure required to install: no more “We can’t deploy because github is down.”

So, the ideal situation is to actually have two levels of server: our private package server, and then a local cache server on each developer’s machine. You could also have a single cache server in the local network, or perhaps three levels. I’m not sure how much of a performance hit not having the cache on the local machine is.

To do this, you need two things. Your local cache needs to be able to use an upstream cache (no dicking around with /etc/hosts please), and your private server needs to be able to provide data to this.

The two tools I have been using handle neither of these. pypicache does not handle upstream caching, however this was easy to patch. My fork handles upstream caching, plus uses setuptools, enabling it to install it’s own dependencies.

localshop, however, will not work as an upstream cache, at least with pypicache, which uses some other APIs than those used by pip. However, it does have nice security features, and to move away from it would require me to extract the package data out. pypicache works to a certain extent with itself as an upstream cache, until you try to use it’s ‘requirements.txt caching’ feature. Which I tried to tonight.

Oh well.

Django and RequireJS

Until very recently, I was very happy with django-compressor. It does a great job of combining and minifying static media files, specifically JavaScript and CSS files. It will manage compilation, allowing you to use, for example, SASS and CoffeeScript. Not that I do.

But, for me, the best part was the cache invalidation. By combining JavaScript (or CSS) into files that get named according to a hash of their contents, it’s trivial for clients to not have an old cached JS or CSS file.

However, recently I have begun using RequireJS. This enables me to declare dependencies, and greatly simplify the various pages within my site that use specific JavaScript modules. But this does not play so well with django-compressor. The problem lies with the fact that there is no real way to tell RequireJS that “instead of js/file.js, it should use js/file.123ABC.js”, where 123ABC is determined by the static files caching storage. RequireJS will do optimisation, and this includes combining files, but that’s not exactly what I want. I could create a built script for each page that has a require() call in it, but that would mean jQuery etc get downloaded seperately for each different script.

I have tried using django-require, but using the {% require_module %} tag fails spectacularly (with a SuspicousOperation exception). And even then, the files that get required by a dependency hierarchy do not have the relevant version string.

That is, it seems that the only way to get the version numbering is to use django’s templating system over each of the javascript files.

There appear to be two options.

** List every static file in require.config({paths: ...}). **

This could be manually done, but may be possible to rewrite a config.js file, as we do have access to all of the processed files as part of the collectstatic process.

Basically, you need to use {% static 'js/file.js' %}, but strip off the trailing .js.

** Rewrite the static files. **

Since we are uglifying the files anyway, we could look at each require([...], function(){ ... }) call, and replace the required modules. I think this would actually be more work, as you would need to reprocess every file.

So, the former looks like the solution. django-require goes close, but, as mentioned, doesn’t quite get there.

Python deployment using fabric and pip

I’ve spent a not insignificant amount of time working on a deployment function for within my fabfile.py (the configuration file used by Fabric). It’s well worth the investment, as being able to deploy with a single command (potentially to many servers) is not only faster, but much less prone to human error.

Currently, I’m using Mercurial as my source control. I’m also using it for deployment, but I’d like to get away from that.

My deployment process looks something like this:

  1. Ensure the local repository has no uncommitted changes.
  2. Ensure the requirements.txt file is exactly the same as the output from pip freeze.
  3. Copy our public key to the remote server, for the user www-data, if it is not already installed there.
  4. Create a virtualenv in the desired location on the server, if there is not one already there.
  5. Ensure mercurial is installed on the server.
  6. Push the local repository to the remote server. This will include any subrepositories. I do a little bit of fancy magic to ensure the remote subrepositories exist.
  7. Update the remote server’s repository to the same revision as we are at locally. This means we don’t necessarily need to always deploy to tip.
  8. Install the dependencies on the remote server.
  9. Run some django management commands to ensure everything is setup correctly.
    • collect static files
    • sync the database
    • run migrations
    • ensure permissions are correct
    • compress static files
  10. Restart the various services that need to be restarted.

This process is based around requirements files for a very good reason. pip is very good at recognising which packages are already installed, and not reinstalling them if the version requirements are met. I use pip freeze > requirements.txt to ensure that what will be deployed matches exactly with what I have been developing (and testing) against.

However, this process has some issues.

  • Files must be committed to SCM before they can be deployed. This is fine for deployment to production, but is annoying for deploying to test servers. I have plenty of commits that turn on some extra debugging, and then a commit or two later, I turn it off.
  • I have some packages that I have installed locally using pip install -e /path/to/package. To deploy these, I need to:
    1. Uninstall the editable installation.
    2. Package up a new version of the app.
    3. Push the package to my package repository (I use localshop).
    4. Install the package from the package repository.
    5. Run pip freeze > requirements.txt.
    6. Commit the changes.
    7. Deploy to the test server.
  • Then, I usually need to develop further, so I reinstall using pip install -e ....

Today, I finally got around to spending some time looking at how pip can help improve this workflow.

With pip==1.3.1, we have a command that was not in pip==1.1, which was what I had been using until now. pip bundle.

My ‘deploy-to-development/test’ process now looks something like:

  1. Get a list of packages installed as editable: pip list -e
  2. Create a bundle, without dependencies, of these packages.
  3. Get a list of all packages, other than those installed as editable: pip freeze | grep -v "^-e".
  4. Ensure the server is set up (virtualenv, etc)
  5. Push the local repository to the remote server.
  6. Upload the bundle and requirements files.
  7. Install from the requirements file on the server.
  8. Force install from the bundle file on the server, without dependencies.
  9. Repeat the post-installation stuff from above.

Some of this I’m never going to be able to avoid: ensuring we have the virtualenv, and the post-installation stuff. Migrations gotta migrate. However, I would like to move away from the pushing of the local repository.

My plan: turn my project into a package (complete with setup.py), so that it becomes just another entry in the requirements file. It will be editable, which means it will be bundled up for deployment.

However, it will mean I can get away from having the nested repositories that I currently have. Ultimately, I plan to be able to:

  1. Build a bundle of editable packages.
  2. Create a requirements file of non-editable packages.
  3. Upload both of these files to the server.
  4. Install the requirements.
  5. Install the bundle.
  6. Run the post installation tasks.

That would be bliss.

hg commit --prevent-stupidity

I’ve had a pre-commit hook in my mercurial ~/.hgrc for some time, that prevents me from commiting code that contains the string import pdb; pdb.set_trace().

I’ve pushed commits containing this out to testing lots of times, and I think even onto production once or twice…

So, the pre-commit hook that has been doing that this is:

[hooks]
pretxncommit.pdb_found = hg export tip | (! egrep -q '^\+[^\+].*set_trace\(\)')

This uses a regular expression check to see if the string matches. However, it does not show the filename, and the other day I was burned by leaving in a console.time(...) statement in a javascript file. So, I’ve improved the pre-commit hook, so it can do a bit more.

## <somewhere-in-your-PYTHONPATH>/hg_hooks/debug_statements.py

import sys
import _ast
import re

class colour:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    END = '\033[0m'

ERROR_HEADER = "*** Unable to commit. There were errors in %s files. ***"
ERROR_MESSAGE = """  File "%s/%s", line %i,
%s%s%s"""

def syntax_check(filename, data):
    try:
        tree = compile(data, filename, "exec", _ast.PyCF_ONLY_AST)
    except SyntaxError:
        value = sys.exc_info()[1]
        msg = value.args[0]

        (lineno, offset, text) = value.lineno, value.offset, value.text
        
        if text is None:
            raise
        
        return lineno, ("%s: %s" % (msg, text)).strip('\n')

ERRORS = {
    'py': [
        re.compile('(^[^#]*import pdb; pdb.set_trace\(\))'),
        syntax_check
    ],
    'js': [
        re.compile('(^[^(//)]*console\.[a-zA-Z]+\(.*\))'),
        re.compile('(^[^(//)]*debugger;)')
    ],
}

def test_data(filename, data):
    for matcher in ERRORS.get(filename.split('.')[-1], []):
        if hasattr(matcher, 'finditer'):
            search = matcher.finditer(data)
            if search:
                for match in search:
                    line_number = data[:match.end()].count('\n')
                    yield line_number + 1, data.split('\n')[line_number]
        elif callable(matcher):
            errors = matcher(filename, data)
            if errors:
                yield errors

def test_repo(ui, repo, node=None, **kwargs):
    changeset = repo[node]
    
    errors = {}
    
    for filename in changeset:
        data = changeset.filectx(filename).data()
        our_errors = list(test_data(filename, data))
        if our_errors:
            errors[filename] = our_errors        
    
    if errors:
        print colour.HEADER + ERROR_HEADER % len(errors) + colour.END
        for filename, error_list in errors.items():
            print 
            for line_number, message in error_list:
                print ERROR_MESSAGE  % (
                    repo.root, filename, line_number,
                    colour.FAIL, message, colour.END,
                )
        print
        return True

Then, add the hook to your .hgrc:

[hooks]
pretxncommit.debug_statements = python:hg_hooks.debug_statements.test_repo

Note: I’ve updated the script to correctly show the line number since the start of the file, rather than the line number within the currently processed segment. Thanks to Vinay for reminding me about that!

Multi-tenanted Django

TL;DR : I made a thing: django-multi-schema. As pointed out in the comments, it’s now known as django-boardinghouse.

Software, as a Service (SaaS)

This is a term that has been around for a while. Basically, providing software that runs on a server, and selling access to that system. Instead of charging people for the software, you charge them a recurring fee for access to the hosted service.

If you have more than one customer, then you need to ensure that each customer can only access data that is ‘theirs’, rather than seeing everything. You need some way to partition the data, and there are two main ways to do this.

  1. Every customer has their own server/database.
  2. All the data is stored in one server/database.

To each their own.

One way of partitioning data is to provision an application server, and a database for each customer.

These web servers, and indeed the databases may not even be on different virtual machines, let alone physical machines. I manage a legacy PHP/MySQL application that runs each ‘server’ as an Apache VirtualHost, shares the codebase, and uses some configuration to route the database connections.

The advantages of this type of system is that it is very easy to move a single instance of the application onto a seperate server, if limits of server performance are reached. Depending upon the configuration, you still only need to upgrade one codebase, although typically there would be a seperate code installation per customer. The advantages of that is that you can upgrade customers individually, perhaps moving specific customers to a beta version.

The disadvantages of this type of setup are similar to what you would have if you had a big enough customer base anyway: multiple servers need to be upgraded, although getting them all done at exactly the same time is not as important if all requests to a given domain always go to a given server (and then to a given database).

However, if you are sharing code between installations, and each has a seperate database, then you do need to migrate all of the databases at the same time, and at the same time as the codebase is updated.

The real disadvantages are about adding a new customer. You need to provision a new server, or at the least, setup a new VirtualHost, and create a new database. You also need to run any database migrations on several databases, but that should be part of the deployment process to each installation anyway.

Another issue that my arise is that each installation requires a seperate set of connections to its own database. If the databases are on the same database server, then this may be a limit that you reach sooner than you would like: but at that time you can just split off some databases to a seperate database server.

There may be only one.

The alternative method of segmenting data is to use Foreign Keys. There is one database table, corresponding to a customer (which may not be a single user). Then, relevant other tables contain a foreign key back to that table, or to a table that links back to that table, or so on.

This is the way my main system I work on works. We have one django installation (or, possibly, several installations that share one database, but are effectively identical clones, just used for load handling). We have a Company table, and everything that should be limited to a given company links back to that, either directly or through a parent. For instance, a RosteredShift does not have a link to a company, but it does link to a RosterUnit, which is linked to a company.

The advantages of this are that you have a single server that needs to be upgraded, or multiple identical servers (that you can just upgrade in parallel). You have a single database, that, again, only needs to be upgraded once. You only have to manage a single database backup, and, importantly, your database connections are equally shared across all of your customers. Indeed, your load is evenly shared, too.

Scaling up is still possible: we can easily stick an extra N app servers into our app server pool, and the load balancer will just farm requests off to that. Sharding databases becomes a bit harder, as you cannot just shift a single customer’s data off onto a seperate database (or indeed push a highly used customer’s app server onto a seperate machine).

The big danger is that a customer may get access to data that belongs to a different customer. In some ways this is the same issue as to within a customer’s users some users seeing data they shouldn’t, but is a bit scarier.

With a single server, and a single domain name, usernames must be unique across all of your customers’ users. This sounds easy: just use email addresses. They are unique. That works well, until an employee of one customer moves to a different employer, that also happens to be your customer, and BANG, email conflict. You don’t want to just move the user, as that would break history for the previous employer. Indeed, they may even be employed by both of your customers at the same time. Possibly without those employers knowing about one another. Privacy, man.

Another useful thing is being able to shift data to a different customer. This is a bit of a double-edged sword - in reality we stopped doing this, and now create a copy of the relevant data for the new customer (for instance, when a Subway store changes hands). That means the previous owner can retain access to their historical data.

Finally, support staff only need to look in one place to see all customer data. They don’t need to extract from a frazzled user who they work for in order to check why their login is not working correctly, for instance.

A middle ground

One of the advantages of a single server is that you share database connections. No more running out of connections because each customer requires X connections. But, Postgres has a feature called schemas, that sits between database and table:

<database>.<schema>.<table>.<column>

Any query can use a fully qualified, or partially qualified name to refer to a database, schema, table or column.

Postgres uses the term schema differently to the idea of an SQL schema. Instead of meaning the table definitions, a schema is a named sub-database. Every postgres database has at least one schema, usually called public.

Postgres determines which schema should be searched for a matching table using a search_path. This can be a list of schemata: the first one in the list with a matching table will be used for queries against that table name. The default search path is "$user","public", which looks in the schema matching the connected user, then the public schema. A schema that does not exist is just ignored in the search path.

So, we can split our data into one schema per customer. That has the nice side effect of preventing data leakage due to programmer error, as a connection to the database (which has a specific search path at that time), cannot possibly return data to which it should not have access.

Starting to narrow

I work in Django, so from here on in, it’s starts to get rather specific to that framework.

Which schema, when?

One solution to this problem is to take the approach mirroring the ‘one server per customer’ approach. That is, each customer gets a seperate domain. Requests coming in are matched against the domain name that was used in the request, and then set the search path. This is quite simple, and is how django-schemata works. Some middleware matches up the incoming domain to the relevant schema settings, and sets the search path. Indeed, this is the simplest possible approach, as was intended. The schemata are set in the django settings file, which means you cannot create a new schema on the fly. django-appschema does allow this, by using a model in the public schema that contains the list of schema.

This is not the approach I want, as I want everything to come in on the same domain.

So, I came up with a different concept. Base the schema selection upon the logged in user.

When a request comes in, use a lookup table of some sort to determine which schema should be used. Now, since this needs to happen after authentication, we will still need to store auth_user and friends in the public schema. But that’s alright. We can have a related object that determines which schema should be used for looking up the rest of the data, and ensure that those requests which should be schema-aware use that schema.

Indeed, it may be possible for a user to be able to choose between schemata, so we also need a way to pass this information. I settled on storing the desired schema in request.session, and the middleware checks that the authenticated user is allowed to access that schema.

That was the easy part

Working out what the search path should be is indeed the easy part. Set it at the start of the request (our middleware should be early in the chain), and away you go.

The hard problems are avoided by django-schemata because each ‘site’ has it’s own schema, and all of it’s tables are stored in there. Thus, you can simply run ./manage.py syncdb or ./manage.py migrate for each schema, and away you go. They provide a manage_schemata tool, which does just this. They also avoid shared apps, to simplify things.

I needed to be able to share models: indeed, by default, a model is shared. They live in the public schema, and queries will be on this schema. The approach I used was that, instead of using the SHARED_APPS, you need to explicitly mark a Model as “schema aware”.

The only time this really matters is at DDL time. When you query a table, Postgres looks in the search path. As long as you have schema_name,public, you will be fine for all reads and writes. However, to create the tables, you need to use some smarts.

syncdb

Whenever a syncdb happens, we need to do a few things:

  • Make sure we have our clone_schema SQL function loaded. This will be used to clone the schema structure from __template__ to new schemata.
  • Make sure we have a schema with the name __template__.
  • Set the search path: in this case it will be public,__template__. Usually, it will be the other way around, but for this case, we want tables to be created in public, unless we explicitly mark them to be created in __template__.
  • Run the syncdb command.
  • Create schemata for any Schema objects we have in the database.

The syncdb command just runs the normal old syncdb (or south’s, if that is installed). However, we do have a couple of bits of evil, evil magic.

Firstly, we don’t use the standard postgres database backend. Instead, we have one that subtly changes the DatabaseCreation.sql_create_model process. If a model is schema aware, we inject into the CREATE TABLE commands the schema. Thus:

CREATE TABLE "foo" ...;

Becomes:

CREATE TABLE "__template__"."foo" ...;

In fact, it’s a little more than that. We can pass in the name of the schema we are writing to, so, in our second trick, we add a post_syncdb listener, which iterates through all of the schemata, re-running just this command, with the schema passed in.

loaddata

An override of the loaddata command adds the ability to declare the schema: the search path is then set before running the command. Simple.

dumpdata

The override for this command also allows passing in the schema name. Data in schema aware models will only be fetched from here (or the template schema, if nothing is passed in, which should be empty).

Migrations

Unless you are crazy, you probably already use South for migrations. So, the second really hard problem is how to apply the migrations to each schema.

Basically, we want to run each operation, if the model is schema aware, on the template schema, and then on each real schema. But, we can’t just run the migrations multiple times with the search path altered, because (a) South stores it’s migration record in a table in public, so the subsequent runs of the migration would not do anything, and (b) even if we could run them, any operations on the public schema would fail, as they have already been performed.

This looks like a really hard problem, and initially seemed insurmountable. However, there is one thing which makes it really quite simple. South looks in your settings file for SOUTH_DATABASE_ADAPTERS. I’m not sure this normally gets used, but it is required if you are not using a django database backend that South recognises. Like ours.

So, the database adapter is just a thin wrapper over South’s builtin postgres backend. It expects to find a class called DatabaseOperations, and wraps all of the methods that create/rename/delete or whatever tables and columns.

And the wrapper is quite simple. It finds the django model that matches the database table name, and then, if that model is schema aware, repeats the operation on each schema, and then the template schema. If the model is not schema aware, then we set the search path just to the public schema, so create operations will affect that.

More hackery

Admin users are permitted to see data from all schemata, so it’s possible they’ll see a link to an object that is not in their schema. Primary keys are unique across all schemata (but this is simply because the index is shared between them in this implementation), so they’ll get a 404 if they try to follow it. If PKs were not unique across schemata, then they might see the wrong object. Anyway, we can leverage the schema-switching middleware by passing in the correct schema in the URL. But to do this, we need to know at LogEntry creation time which schema is active, if the object is schema aware. So, I monkey-patch LogEntry to store this information, and generate URLs that include it.

I’ve also done a bit of work on adding the schema in when you serialise data. This is mainly for dumpdata, as I don’t think you should be passing around objects for deserialisation to untrusted sources: it really should be going through form validation or similar. But for loaddata/dumpdata, it might be useful. At this stage the schema value is not used, but eventually the deserialiser should look at that, and somehow ensure the object is created in the correct schema. For now, just use loaddata --schema .... That’s better anyway.

ImproperlyConfigured

One really nice thing about django is that it has an ImproperlyConfigured exception, which I have leveraged. If the database engine(s), south database adapter(s) or middleware are missing or incorrect, it refuses to load. This is conservative: you may have a database of a different engine type, or have no schema aware models, but for now it’s not a bad idea.

Also, if South is not installed before us, we need to bail out.

Well, that’s most of it. There’s some more gravy (signals before and after activation of schema, as well as when a new schema is created), but while it has been tested, there is no automated test suite as yet. Nor is there a public example project. But they are coming.

Oh, and it’s up on BitBucket: django-multi-schema. Although, I’m not that happy with the name.