Plain-text bad-news

For the past few years, I’ve run in the City-Bay, Adelaide’s largest Fun Run.

This year, when entering, I needed to reset my password.

Imagine my surprise when they sent me my password in plain text.

Then, imagine my surprise, or lack of it, when they sent me:

Dear Matthew,

We would like to advise that we have removed access to our online entry system after customers reported data missing from their records. Although investigations are underway, we would like to advise no financial information of race entrants is stored on our system and is therefore not at risk.

We recommend that you change the password you have used on our website (once it is deemed safe to reopen). As an added precaution we also advise that you change your password on any other website it is used.

Thank you for your understanding and continued interest in our event,

Kind Regards,

City-Bay Fun Run team

www.city-bay.org.au

Well, you know, if you store passwords, in plain text, then you are just asking for trouble.

Review Django Essentials

Django Essentials. Note it appears the name of this book has been changed from “Getting started with Django”.

I’ll be clear from the outset: I have some pretty strong issues about the first part of this book, and I’m going to be quite specific with the things that I think are wrong with it. Having said that, the later chapters are far better than the earlier ones.

I am not sure, however, that it’s any more accessible than the official documentation. There’s probably a market for a more thorough tutorial than the one on the Django website, however, I’m not sure this book, as it stands, is that tutorial.

How could this book be better?

I think it gets bogged down providing detail in areas that are just not that important at that point in time. I also think it misses a good overview of the product that is being built: indeed it’s never clear, even after completing the book, exactly what the product is supposed to do.

In my opinion, the code examples are hard to read. This is a combination of the styling of the source code, and the layout. That bold, blue is quite jarring in comparison to the rest of the text, and the repeated lack of PEP8 compliance, especially when coupled with reading it on a narrow device, make it hard to follow the code. Multiple code blocks (which should be in separate files) flow together, making it hard to see where one stops and the next begins.

The book fails early on to push some basic Python standards and best practices. In some cases these are addressed later on, however it is not obvious what is gained by not starting from this point. Similarly, there are some security issues that should never have passed through editing. Again, these are addressed later, but I feel that the damage has already been done. Friends don’t let friends store passwords in plain text; and very little is gained by disabling the CSRF protection.

But it’s not just the source code that seems lacking. The technical translation at times varies between the obtuse and the absurd. Early chapters in particular (the ones that I think are more important when teaching basic concepts) contain sentences or paragraphs that required me to re-read several times in order for me to be able to translate it into something that made sense to me. And I’ve been writing Django code for about 6 years (and Python code for probably another 6 before it).

Would I recommend it?

After hitting the plain-text-password section, I said no. I actually have a couple of guys much newer to Django than me at work, and I did not want them to read the book at that point.

However, after I’d cooled down, and actually started to draft this review, I re-read the start, and read the rest. There is some good information, but I’m not sure that it’s presented in a way that is better than the official documentation, or some other resources out there.

So, I’m really not sure I’d recommend it to a beginner. There are too many things early in the book that set up for future failures (or at least, unlearning). And I’m not sure I’d recommend it to an intermediate developer. It’s not that it’s bad (with the caveats below), it’s just not as good as what you can read on the Django website.

Some of the more important specific issues that I feel are wrong with this book follow. These are often things that beginners struggle with. You’ll notice less stuff about the later chapters. That’s because they are better.


Code standards.

Throughout the book, there are inconstencies with how individual models and modules are named. Whilst this seems pedantic, computers are pedantic, when it comes to textual source code. It does matter if you use Work_manager in one place, and the Workmanager in another.

Further, in Python, we always (unless the project we are working on has different standards) use snake_case for module names, TitleCase for class names, and snake_case again for variables, methods and functions, and ANGRY_SNAKE_CASE for constants. There’s just no reason to go against these guidelines.

Okay, I may have made up the name ANGRY_SNAKE_CASE.

Finally, Python code should be compliant to PEP8. I’m not sure that a single line of code in this book would pass through a PEP8 checker.


MVC/MVT

The section on “The MVC Framework” (tip: Django isn’t) seems superfluous. It would be far better to avoid this term, and instead describe the typical flow of data that one might see in a request-response cycle handled by Django:

  1. The client sends a request to the server
  2. The server passes the request to the correct view function (according to the url)
  3. The view function performs the required work, and returns an HttpResponse object.
  4. The HttpResponse object is sent back to the server.

Depending upon the view, it may do any or all of the following:

  • Process data provided by the client using a Form
  • Load and/or save data to/from the database
  • Render an HTML template or return a JSON (or XML) response.
  • Perform any other action that is required

The whole concept of a Controller doesn’t really make sense in the context of a web page, although purely within the client-side of a Single-Page-Application it could.


Installation.

I’ve written about installation before, notably discussing how every project should be installed into a new virtualenv. Indeed, I even install every command-line application in it’s own environment. And, most of the experienced Pythonistas I have come across always use a new virtualenv for each project, both in development and in deployment. So it was worriesome to see a non-best-practice approach used for installation.

Although this is addressed later in the book (in the chapter on deployment), I fail to understand the benefit of not mentioning it now. There are so many reasons to use virtualenv in development, and none I can think of for avoiding it.


Security

There are two things in this book that set off alarm bells for me, with respect to security. I’ve mentioned them above, but I’ll go into a little more detail.

The more minor error is the disabling of CSRF checking. The inbuilt Django CSRF protection ensures a range of attacks are ineffective, and the mental cost of using this protection is fairly low: in any view that you are POSTing back to the server, you need to include the CSRF token. This is usually done as a form field, using the csrf_token template tag.

Disabling it is almost never a good idea.

Suggesting that you disable it “just for now” as the only thing you change in the initial settings file is even worse. A beginning programmer may begin routinely disabling CSRF protection as they start a new project, and not re-enabling it. Bad form.

The severe error is storing user passwords in plain text. This flaw is so basic that, even though it is “fixed” later in the book, as is CSRF protection, by then I feel it is too late. Even hinting that either of these things is acceptable to do as an interim measure (do you have any idea how much “interim” or temporary code I have in production still, years after it was written?) makes me really struggle to continue reading.

However, I am glad I did.


URL routing and regular expressions

This book contains a reasonable explanation of regular expressions, but I think it would have been better suited to have a more concrete set of examples of how regular expressions can be used for URL routing in Django. For instance:

You could use a series of examples, like these, to describe some of the key rules of regular expressions, and at the same time discuss parameters. Alternatively, you could skip regular expressions at all at this point in time, and use simple strings.

When discussing URL routing, the following paragraph is a great example of a failure to explain what is essentially a simple process.

“After having received a request from a web client, the controller goes through the list of URLs linearly and checks whether the URL is correct with regular expressions. If it is not in conformity, the controller keeps checking the rest of the list. If it is in conformity, the controller will call the method of the corresponding view by sending the parameters in the URL.”

Phrased in a simpler manner:

“The URL resolver looks at each pattern in the order it is defined. If the regular expression of the url route matches the request’s path, the matching view method is called with the request object and any other parameters defined, otherwise it is passed on to the next route.”


Templates

This book presents a reasonable discussion of the Django template language. There are some parts that made me do a double-take (legacy of templates? Oh, you mean inheritance), and there are lots of important typos, missing characters, or just plain wrong source code.

And then there’s render_to_response.

Back in the day, we used to use a function called render_to_response(), which required you to manually pass a RequestContext instance to it: we have since moved on to render(). There is no need really to mention render_to_response() in anything other than a footnote: “You might see older code that uses…”

Talking about the context itself is good, but I think it should be more explicit: “You pass three arguments to render(): the request object, the template path and a dict containing the variables from your view that you want available in the rendering context”.

Oh, and later in the book, locals() is passed as the context. The Zen of Python: explicit is better than implicit. Yes, in the box immediately afterward, it is suggested that you don’t do this.

Doing something, and then suggesting that you don’t do it is counterproductive.


Models

Django’s ORM gets some criticism at times. I find it’s mostly good enough for my needs, indeed, it sometimes does a better job of writing queries than me. However, it is an Object Relational Mapper, and discussing how that works is simple terms would probably be useful. It’s not strictly necessary to have a strong background in relational databases and SQL to use correctly, but understanding some of the implications of how accessing things from the ORM can cause issues, or indeed, how the data is even represented in the database can only be a positive.

“To make a connection between databases and SQL, we can say that a model is represented by a table in the database, and a model property is represented by a field in the table.”

Cumbersome language again, and not totally wrong, but probably slightly misleading. Perhaps:

“…a Model class is represented by a table in the database, and an instance of that Model is represented by a row/tuple. Fields on the model (which appear as special attributes) are the columns of that row.”

A discussion of south is also somewhat welcome. Even though the soon to be released Django 1.7 contains a superior (but written by the same person) implementation of migrations, it’s certainly still worth understanding a little about how south works.

However, there is one false statement when discussing south:

“Never perform the Django syncdb command. After running syncdb --migrate for the first time, never run it again. Use migrate afterwards.”

This is a broken statement. If you were to add a new app that did not have migrations, then without a syncdb command, the tables for it’s models would not be created.

This chapter suddenly gets a whole lot worse as soon as a model is defined with a plain-text password field, but I’ve already discussed that.


Django.contrib.admin

I spend a lot of time trying to talk people out of using the admin module as anything other than a top-level admin tool. Really, this is a tool that is fantastic for viewing and maniplating data in the early stages of development, and great for emergency access for a developer or trusted admin to view or change data on a live system, but trying to push too much into it is problematic. I say that as someone who has a project that relies far too much on the admin.

It’s also hard to not discuss the admin, as it really is a great tool, but it’s really important to understand it’s limitations.

I quote Django core contributor Russ Keith-Magee:

“Django’s admin is not meant to be the interface for your website”


QuerySets

Interestingly, the chapters on QuerySets and Forms are actually far better than those preceeding. The source code isn’t formatted any better, but it really does seem that the translations make (mostly) more sense.

I do think the manner of adding data to the database is bunkum, however. Given that we just covered the admin interface, it would make sense to use this to add some data, before looking at QuerySets. And we could delve into manage.py shell in order to illustrate how QuerySets, their various methods, and some model methods actually work.

And while we are on anti-patterns: queryset[:1].get() is pointless. You might as well just use queryset[0]. It is exactly the same SQL, and easier to read.


Forms

And then we get to Forms. I’m a really big fan of Django’s form handling: it’s something that makes dealing with user input much safer, and simpler. And this chapter explains that, but, from an educational perspective, I’m cautious that showing someone the wrong way to do something first is counter-productive.

Sure, I understand that it makes a point, and having done something a laborious, error-prone way for some time, and then being shown a safer, faster, easier method is eye-popping, but I fear that for some percentage of readers, they will get a takeaway that not using Forms is a valid choice.

Even beginning with a ModelForm is probably a nice approach, as you can get a lot of functionality with almost no code at all.


CBV

The section on Class Based Views is okay too. These are something else that are often hard to understand, and the initial official documentation on them was sadly lacking. Once you have your head around how they work they can be really powerful, and this book takes the right approach in suggesting caution about not always using them. Similarly, it is great that these were not used as a starting point.

However, I find that the explanations and descriptions are not always clear. Certainly as an experienced Django user I can read and understand what is going on, but as a beginner I think this chapter would be hard to follow. Perhaps a simple discussion about what the different CBV are used for, and how the ViewClass.as_view() pattern works, and why it is required, and then some examples.

Perhaps a better approach would have been to have written Model and Form classes earlier, and then writing the function-based views to CRUD those objects, and then rewriting the exact same views using CBV.


Django.contrib.auth

Although far less impressive that the admin, I think that auth is a more important module. Especially now given we can easily swap out auth.User, to get the desired user functionality, I think this is something that should be given more weight. It doesn’t need to necessarily come before the chapter about the admin, but it should be discussed, or at least introduced, before anything is done with a User-ish model.

I think this book does not do justice to Django.contrib.auth. There are lots of views and forms that can (and should) be used to save writing your own code, which is more likely to have bugs. Also, even if the basic User model is used in the example, a discussion of how easy it is to swap out, and get “email as username” functionality is certainly deserved.


AJAX

I’m probably 50-50 on the AJAX chapter. I guess I understand why you’d want to include a chapter on it, but I worry that this chapter maybe doesn’t do enough. If it’s an introduction to AJAX I’m not sure it seems up to that.

I do often use jQuery, but it’s probably not too tricky to rewrite the code to delete an object using vanilla Javascript. And if you are going to use jQuery, you should get the idioms right.

var cases = $('nav ul li').each(function() {
  $(this).addClass('nav_item');
})

Can easily be written:

$('nav ul li').addClass('nav_item');

And we probably shouldn’t use $(foo).html($(foo).html() + bar), when really we want to use $(foo).append(bar).

Also, I don’t think that using csrf_exempt is a great idea: the official documentation has details about how to use AJAX and still keep CSRF protection.


Thanks to:

  • Maior in #django for proofreading.

Dear ABC, please unbreak iView

Dear ABC,

The latest iOS iView apps use a custom media player, rather than the system supplied one. This actually degrades performance and usability in a few areas.

Specifically, when the stream is still loading, and you select AirPlay, then it will sometimes not actually send to the target device (and will not play on the local device either). This did not (to my recollection) occur with the old version.

Even more annoying, when attempting to scrub to a later point within the programme, it would more often than not just revert back to the previous point (prior to scrubbing). This happened about 60% of the time I was using the app last night. Again, this is something I have not seen when using either the old iview app, or other apps that use the default media player controls (but is something I see on apps that use custom players, such as the AU commercial channels' apps).

Finally, after interacting with the controls, they do not fade after a period of time. Again, this is a significant regression compared to the previous version(s), and standard media controls.

Regards,

Matt.

(Note: I’ve also posted this as a complaint on http://www.abc.net.au/contact/contactabc.htm).

tox and coverage.py

Tox makes it really easy to run multiple tests on your project: against different versions of python or different versions of a related library.

It’s still lacking proper matrix testing: you need to manually define each environment, but apparently, that is going to change:

However, that’s not what I’m writing about today.

Today is about coverage testing, using coverage.py.

It’s possible, using tox, to get coverage.py to run:

[testenv]
commands=
  coverage run setup.py test
  coverage report
deps=
  coverage

However, this will generate a coverage report for just that environment. It would be better if you generated a coverage report for the whole project (although you may want per-environment coverage testing too).

So, we can abuse the fact that the tox envlist will be created and processed in the order they appear:

[tox]
envlist = clean,py27,py34,stats

[testenv]
commands=
  coverage run -a setup.py test
deps=
  coverage

[testenv:clean]
commands=
  coverage erase

[testenv:stats]
commands=
  coverage report
  covarage html

You’ll then get a nice html report in htmlcov/, and a printed coverage report in your console.

Multiple Homebrew Pythons

The Homebrew project brings a really nice package installer to OS X, allowing you to install command line programs in a really simple way. For instance, to install the latest version of Python 3, you would do:

$ brew install python3

Because not all projects are as aware of old versions as python, when brew upgrades a package, it removes the old versions linked binaries and support files. This is actually not a good thing for python: it means you can no longer access the older interpreter.

Python keeps version-named interpreters, and then just symlinks the most recently installed to the python executable. Thus, it’s not uncommon to see, for python2:

$ ls -1 /usr/bin/python*
/usr/bin/python
/usr/bin/python-config
/usr/bin/python2.5
/usr/bin/python2.5-config
/usr/bin/python2.6
/usr/bin/python2.6-config
/usr/bin/python2.7
/usr/bin/python2.7-config
/usr/bin/pythonw
/usr/bin/pythonw2.5
/usr/bin/pythonw2.6
/usr/bin/pythonw2.7

This means, if you want to run an older version (for instance, say you use tox and want to do some testing against a range of versions), you can just use:

$ python2.5
Python 2.5.6 (r256:Unversioned directory, Mar  9 2014, 22:15:03) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

But Homebrew breaks this.

However, if you have the old versions already installed, you can easily recreate the symlinks. Indeed, here is a script that will visit all homebrew installed python3 versions, creating

cd /usr/local/Cellar/python3/

for VERSION in `ls`
do
  cd $VERSION
  
  find bin -name \*3\* -exec ln -s -f `pwd`/{} /usr/local/{} \;
  find lib -name \*python\* -maxdepth 1 -exec ln -s -F `pwd`/{} /usr/local/{} \;
  find share -name python\* -exec ln -s -f `pwd`/{} /usr/local/{} \;
  find Frameworks -name 3.\* -exec ln -s -f `pwd`/{} /usr/local/{} \;
  
  cd ..
done

It worked for me for python3 with the following versions installed:

  • 3.2.3
  • 3.3.3
  • 3.4.1

Now I just need to figure out how to get Homebrew to download and build specific versions of packages.

Adding JSON operators to PostgreSQL

Notably, the new JSONB data type in postgres is missing some of the features that the hstore data type has had for some time.

hstore - text     : delete key from left operand
hstore - text[]   : delete keys from left operand
hstore - hstore   : delete matching pairs from left operand

However, using two awesome features of Postgres, it’s possible to add these operators in.

Firstly, python as a language in postgres:

matt=# CREATE LANGUAGE plpythonu;

Then, you can write functions in python, that import standard system libraries, like json.

CREATE OR REPLACE FUNCTION json_subtract(json_object json, key text) RETURNS json AS $body$
import json
data = json.loads(json_object)
data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

Finally, you can now overload the operator to get the syntactic sugar you want:

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text,
  PROCEDURE = json_subtract
);

Now, you can use the same syntax as for hstore:

matt=# SELECT '{"a":1, "b":2}'::json - 'a'::text;
 ?column? 
----------
 {"b": 2}
(1 row)

It’s possible to repeat these for the other subtraction operators:

CREATE OR REPLACE FUNCTION json_subtract(json_object json, keys text[]) RETURNS json AS $body$
import json
data = json.loads(json_object)
for key in keys:
    data.pop(key, None)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = text[],
  PROCEDURE = json_subtract
);

CREATE OR REPLACE FUNCTION json_subtract(json_object json, pairs json) RETURNS json AS $body$
import json
data = json.loads(json_object)
pairs_data = json.loads(pairs)
for key,value in pairs_data.items():
  if key in data and data[key] == value:
    data.pop(key)
return json.dumps(data)
$body$ LANGUAGE plpythonu;

CREATE OPERATOR - (
  LEFTARG = json,
  RIGHTARG = json,
  PROCEDURE = json_subtract
);

I’ll leave it as an exercise to write functions for the other operators.

In-Database Audit Trail

I’ve been thinking about audit trails, object versioning and the like a bit lately. We have situations where it’s actually important to be able to know what changes were made, when, and by whom.

The tool we are currently using in Django for this is django-reversion, but I’m not that happy with it. Part of the problem is that it adds a fair amount of overhead into each request, which we have noticed, and also that querying the data it produces is not simple, unless you are just looking for changes made to a specific object.

For our case, we want to know all the changes made to a set of objects (that may not all be of the same type), made within a given range, where there is (eventually) a foreign key relationship back to our Company model.

Part of the problem is just that: a custom solution could store the Company relationship explicitly, but also, we really care about what the changes were, rather than what the object looks like after the changes. Well, technically, we want both.

However, it has gotten me thinking about other ways to solve this problem.

In most DBMSs, it’s possible to get a trigger function to run when an update happens to each row, which makes it possible to get access to this information. Indeed, googling “Postgres audit trigger” pops up some hits that are highly relevant. I looked at Audit trigger 91plus, since it’s “improved”. It has some really nice features like storing the audit data in an hstore column, which means you can query it.

However, one place where this type of trigger breaks down is that it’s not generally possible to get the “application user” associated with a request, only the “database user”, which for our system (and most other web applications) is fixed for all access.

One way to get around this might be to, at the start of every database transaction, inject a call that creates a temporary table, with the extra bits of data that you want to log, and then use that in the trigger function.

CREATE TEMP TABLE IF NOT EXISTS
  "_app_user" (user_id integer, ip_address inet);

Then we need to add (or update) the one row that will contain our data. We must ensure that we only ever have one row in this table.

UPDATE _app_user SET user_id=%s, ip_address=%s;
INSERT INTO _app_user (user_id, ip_address)
  SELECT %s, %s WHERE NOT EXISTS (SELECT * FROM _app_user);

This code will ensure that the first statement (UPDATE) will affect all rows in the table (of which there will be at most one), and the second statement (INSERT ... SELECT ... WHERE NOT EXISTS ...) will only create a new row if there are no rows currently in the table.

It’s up to you to then pass the correct data to this. I’m currently looking at doing this using Django middleware, although I suspect this may fall down using the newer transaction handling, as otherwise we could have just ensured our middleware ran after the TransactionMiddleware. It may be possible to do it with a custom database backend, but it needs to somehow get access to the request object (which contains the user, and the ip address). Obviously, you could log other data about the request, too.

The final part of the puzzle is to inject this data into the row that will be used for the audit table entry. I modified the table definition so it included columns for the data I wanted: app_user_id and app_ip_address.

Then, inside the actual trigger function, after the audit_row object has been created, but before it is written to the table, we inject the data we want.

We need to be a little careful, as it’s possible the table does not exist:

BEGIN
  PERFORM 
    n.nspname, c.relname 
  FROM
    pg_catalog.pg_class c 
  LEFT JOIN 
    pg_catalog.pg_namespace n
  ON n.oid = c.relnamespace
  WHERE
    n.nspname like 'pg_temp_%' 
  AND
    c.relname = '_app_user';

  IF FOUND THEN
    FOR r IN SELECT * FROM _app_user LIMIT 1 LOOP
      audit_row.app_user_id = r.user_id;
      audit_row.app_ip_address = r.ip_address;
    END LOOP;
    END IF;
END;

This checks to see if the _app_user table exists in any of the valid temporary table namespaces, and if so, grabs the first (and only, from above) entry, using the values to update the row.

This function then works: if there is a temporary table with this name, it uses these fields when creating the audit, if not, it creates the audit row with empty values. This would mean that some audit statements may not contain proper user data, but in the case of Django, it’s possible to make changes outside of the request-response cycle. You could require that a user starting a shell session authenticates with a valid django username+password, but that still leaves management commands. I guess you could have a system account, but leaving these entries blank is like an explicit system user.

I haven’t got any production code using anything like this: I’d still want to test that it works as expected as part of the request, and would want to build up some method of querying it. There’s probably no reason you couldn’t do the table definition as a Django model (and indeed, have the function definition as a migration).

rrule to RFC-string

I’ve been playing around with Postgres lots lately, and I had the idea to store RRULE data in a Postgres Composite Type.

It then occurred to me I didn’t need to reimplement all of the great stuff that is in the python dateutil module: I can just use PL/Python and import it.

The next step was realising I didn’t need to use a custom type, but just use a new Domain of type text that validates the string is an RFC 2445 compatible string.

That’s all well and good, but dateutil doesn’t come with a way to convert rrule objects back into a string.

There is a patch, but it wasn’t quite right. So here is my method (which can be monkey-patched, as seen at the end):

FREQNAMES = ['YEARLY','MONTHLY','WEEKLY','DAILY','HOURLY','MINUTELY','SECONDLY']

def rrule_to_string(rule):
    output = []
    h,m,s = [None] * 3
    if rule._dtstart:
        output.append(rule._dtstart.strftime('DTSTART:%Y%m%dT%H%M%S'))
        h,m,s = rule._dtstart.timetuple()[3:6]
    
    parts = ['FREQ='+FREQNAMES[rule._freq]]
    if rule._interval != 1:
        parts.append('INTERVAL='+str(rule._interval))
    if rule._wkst:
        parts.append('WKST='+str(rule._wkst))
    if rule._count:
        parts.append('COUNT='+str(rule._count))
    
    for name, value in [
            ('BYSETPOS', rule._bysetpos),
            ('BYMONTH', rule._bymonth),
            ('BYMONTHDAY', rule._bymonthday),
            ('BYYEARDAY', rule._byyearday),
            ('BYWEEKNO', rule._byweekno),
            ('BYWEEKDAY', rule._byweekday),
            ]:
        if value:
            parts.append(name+'='+','.join(str(v) for v in value))
    
    # Only include these if they differ from rule._dtstart
    if rule._byhour and rule._byhour[0] != h:
        parts.append('BYHOUR=%s' % rule._byhour)
    if rule._byminute and rule._byminute[0] != m:
        parts.append('BYMINUTE=%s' % rule._byminute)
    if rule._bysecond and rule._bysecond[0] != s:
        parts.append('BYSECOND=%s' % rule._bysecond),
    
    
    output.append(';'.join(parts))
    return '\n'.join(output)

from dateutil.rrule import rrule
rrule.__str__ = rrule_to_string

I’ve only lightly tested it (in both regular python and PL/Python). I’ve also come up with a neat way of caching rrule objects between function calls, but that’s a topic for another day, but here it is in an SQL session.

SELECT rrulestr('DTSTART:19970902T090000 FREQ=DAILY;INTERVAL=10;COUNT=5');

            rrulestr            
--------------------------------
 DTSTART:19970902T090000       +
 FREQ=DAILY;INTERVAL=10;COUNT=5
(1 row)

Querying JSON in Postgres

Yesterday, I discovered how you can enable jsonb in postgres/psycopg2.

Today, I experimented around with how to query the data in json columns. There is documentation, but it wasn’t initially clear to me how the different operations worked.

CREATE TABLE json_test (
  id serial primary key,
  data jsonb
);

INSERT INTO json_test (data) VALUES 
  ('{}'),
  ('{"a": 1}'),
  ('{"a": 2, "b": ["c", "d"]}'),
  ('{"a": 1, "b": {"c": "d", "e": true}}'),
  ('{"b": 2}');
  

So far, so good. Let’s see what’s in there, to check:

SELECT * FROM json_test;
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(5 rows)  

Super. Let’s have a go at filtering those results. There are several operators that we can use (and we’ll soon see why we chose jsonb).

Equality

Only available for jsonb, we can test that two JSON objects are identical:

SELECT * FROM json_test WHERE data = '{"a":1}';
 id | data 
----+------
  1 | {"a": 1}
(1 row)

Containment

Again, jsonb only, we can see if one JSON object contains another. In this case, containment means “is a subset of”.

SELECT * FROM json_test WHERE data @> '{"a":1}';

Give me all objects that contain the key "a", with the value 1 associated with that key:

 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

Containment goes both ways:

SELECT * FROM json_test WHERE data <@ '{"a":1}';

In this case, we can see that the query object is a superset of the empty object, as well as matching exactly to object 2.

 id |   data   
----+----------
  1 | {}
  2 | {"a": 1}
(2 rows)

Key/element existence

The last batch of jsonb only operators: we can test for the existence of a key (or an element of type string in an array, but we’ll get to those later).

SELECT * FROM json_test WHERE data ? 'a';

Give me all objects that have the key a.

 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(3 rows)

We can also test for objects that have any of a list of keys:

SELECT * FROM json_test WHERE data ?| array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(4 rows)

And, as you may expect, for objects that have all of the keys:

SELECT * FROM json_test WHERE data ?& array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

Key-path traversal

We can also filter records that have a matching key-path. In simple cases, using the containment operators might be simpler, but in more complex situations, we would need to use these. These operations can also be used to extract a value, although at this stage I’m only interested in using them as part of a WHERE clause.

SELECT * FROM json_test WHERE data ->> 'a' > '1';

Give me all the records where the value of the element associated with key a is greater than 1. Notice the need to use a text value, rather than a number. I’m still investigating how this will play out.

 id |           data            
----+---------------------------
  3 | {"a": 2, "b": ["c", "d"]}
(1 row)

We can also do comparisons between primitives, objects and arrays:

SELECT * FROM json_test WHERE data -> 'b' > '1';
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(3 rows)

So, it seems that arrays and objects sort greater than numbers.

We can also look deeper down the path:

SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"';

Give me objects where element b has a child object that has element c equal to the string "d". Neat.

 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}

There are also versions of these operators that return a text, rather than a json object. In the case of the last query, that means we don’t need to compare to a JSON object (in the case where we actually want a string).

SELECT * FROM json_test WHERE data #>> '{b,c}' = 'd';
 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(1 row)

Don’t cross the streams…

So, all good so far. We can query stuff, and this same stuff can be used to index jsonb columns, too.

However, the more astute reader may have noticed that I’ve been dealing with json data that has an object as it’s root. This needn’t be the case: arrays are also valid json, indeed so are any of the allowable atoms:

SELECT 
  'null'::json, 
  'true'::json, 
  'false'::json, 
  '2'::json,
  '1.0001'::json,
  '"abc"'::json, 
  '1E7'::jsonb;

Note the last one is a jsonb, which converts to canonical form:

 json | json | json  | json |  json   | json  |  jsonb   
------+------+-------+------+---------+-------+----------
 null | true | false | 2    | 1.00001 | "abc" | 10000000
(1 row)

Note also that a json null is different to an SQL NULL.

So, what happens when we start storing objects of mixed “type” in a json column?

I’m glad you asked.

INSERT INTO json_test (data) 
VALUES ('[]'), ('[1,2,"a"]'), ('null'), ('1E7'), ('"abc"');

SELECT * FROM json_test;
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
  6 | []
  7 | [1, 2, "a"]
  8 | null
  9 | 10000000
 10 | "abc"
(10 rows)

So far, so good. We can store those objects. And query?

Equality testing works fine:

SELECT * FROM json_test WHERE data = '{"a":1}';
SELECT * FROM json_test WHERE data = 'null';

Containment, too works as expected.

SELECT * FROM json_test WHERE data @> '{"a":1}';
SELECT * FROM json_test WHERE data <@ '{"a":1}';

Key and element existence perform reliably: perhaps surprisingly, the one query will match elements in an array, as well as keys in an object.

SELECT * FROM json_test WHERE data ? 'a';
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  7 | [1, 2, "a"]
(4 rows)
SELECT * FROM json_test WHERE data ?| array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
  7 | [1, 2, "a"]
(5 rows)
SELECT * FROM json_test WHERE data ?& array['a', 'b'];
 id |                 data                 
----+--------------------------------------
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(2 rows)

But, as soon as we start doing key or element ‘get’, we hit a problem:

SELECT * FROM json_test WHERE data ->> 'a' > '1';

ERROR: cannot call jsonb_object_field_text 
       (jsonb ->> text operator) on an array

We can still use the key-path traversal, though, unless we have scalar values:

SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"';
ERROR:  cannot call extract path from a scalar
SELECT * FROM json_test WHERE data #> '{b,c}' = '"d"' AND id < 8;
 id |                 data                 
----+--------------------------------------
  4 | {"a": 1, "b": {"c": "d", "e": true}}
(1 row)

Note the syntax for a key path: it only allows for strings (which json keys must be), or integers (which array indices are).

This seems like a pretty severe limitation. I’m not sure how things like MongoDB handle this, but in hindsight, if you are storing both array-based and object-based json data in the one column, you are probably going to be in a world of hurt anyway.

…or, maybe, do cross the streams

All is not lost, however: it’s possible to get just the object-based rows:

SELECT * FROM json_test WHERE data @> '{}';
 id |                 data                 
----+--------------------------------------
  1 | {}
  2 | {"a": 1}
  3 | {"a": 2, "b": ["c", "d"]}
  4 | {"a": 1, "b": {"c": "d", "e": true}}
  5 | {"b": 2}
(5 rows)

You could then combine this with a previously-forbidden query:

SELECT * FROM json_test WHERE data @> '{}' AND data ->> 'a' > '1';
 id |           data            
----+---------------------------
  3 | {"a": 2, "b": ["c", "d"]}
(1 row)

Indeed, postgres is so awesome you don’t even need to ensure the data @> '{} bit comes first!

But what about limiting to just array-typed data? Turns out we can use the same trick:

SELECT * FROM json_test WHERE data @> '[]';
 id |    data     
----+-------------
  6 | []
  7 | [1, 2, "a"]
(2 rows)

And, again, combine with the other required operator:

SELECT * FROM json_test WHERE data @> '[]' AND data ->> 1 = '2';
 id |    data     
----+-------------
  7 | [1, 2, "a"]
(1 row)

Worth noting is that the @> operator is only available on jsonb columns, so you won’t be able to query mixed-form data in a regular json column.

Wow! What’s next?

This foray into querying jsonb data in postgres was an aside to a project I’m working on to bring json(b) querying to django. With django 1.7’s new custom lookup features, it will be possible to write things like:

# Exact
MyModel.objects.filter(data={'a': 1})
MyModel.objects.exclude(data={})
# Key/element existence
MyModel.objects.filter(data__has='a')
MyModel.objects.filter(data__has_any=['a', 'b'])
MyModel.objects.filter(data__has_all=['a', 'b'])
# Sub/superset of key/value pair testing
MyModel.objects.filter(data__contains={'a': 1})
MyModel.objects.filter(data__in={'a': 1, 'b': 2})
# Get element/field (compare with json)
MyModel.objects.filter(data__get=(2, {'a': 1}))
# Get element/field (compare with scalar, including gt/lt comparisons)
MyModel.objects.filter(data__get=(2, 'a'))
MyModel.objects.filter(data__get__gt=('a', 1))
# key path traversal, compare with json or scalar.
MyModel.objects.filter(data__get=('{a,2}', {'foo': 'bar'}))
MyModel.objects.filter(data__get=('{a,2}', 2))
MyModel.objects.filter(data__get__lte=('{a,2}', 2))

I’m still not sure about the lookup names, especially the last set. The name “get” seems a little generic, and maybe we could use different lookup names for the input type, although only integer and string values are permitted.

Python, postgres and jsonb

I maintain a json field for django, and was working today on getting the new (1.7+) lookup code to play nicely: in order for this to happen, you basically need to be running Postgres 9.4, and using a jsonb column. Otherwise, querying kind-of sucks.

After a significant amount of work, where I drift backwards and forwards between having old and new code working, I had an idea.

Some time ago I discovered that psycopg2 has really nice support for some custom types. Indeed, it’s super-easy to get it to handle UUID and json data. But it seems that it hasn’t yet been made to work with jsonb.

However, the registration process for handling the data makes it possible to do so, and trivial, since the serialised form will be essentially identical for both:

psycopg2.extras.register_json(
    conn_or_curs=None,
    globally=False,
    loads=None,
    oid=None,
    array_oid=None
)

Note the last two arguments. We can trick psycopg2 into using jsonb instead of json.

Is your database, execute:

SELECT oid, typarray FROM pg_type WHERE typname = 'jsonb';
-- oid      --> 3802
-- typarray --> 3807

(Syntax highlighting fail means I can’t include the actual results).

Your values may vary (I’m really not sure), but you’ll simply need to call register_json with the first two:

register_json(oid=3802, array_oid=3807)

Now, assuming you have a jsonb column, when you fetch data from it, it will already be turned into python objects.

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
>>> conn = psycopg2.connect("")
>>> cur = conn.cursor()
>>> cur.execute("SELECT * FROM jsonb_test;")
>>> data = cur.fetchone()
>>> data
(1, '{}')
>>> from psycopg2.extras import register_json
>>> register_json(oid=3802, array_oid=3807)
(<psycopg2._psycopg.type 'JSON' at 0x101713418>, <psycopg2._psycopg.type 'JSONARRAY' at 0x101721208>)
>>> cur.execute("SELECT * FROM jsonb_test;")
>>> data = cur.fetchone()
>>> data
(1, {})