Updated: acronymmer.py for ecto

I’ve improved acronymmer.py, my script for adding acronym tags to posts in ecto. It will work with any file, however, that is passed as an argument to the script. There is one issue with ecto, and that is that abbr tags are not recognised by the Rich Text parser, so I’ve just set it to convert tags to acronym only for the time being. (Note to Adriaan: You should fix this!) Don’t run this script over the same text twice: it will re-tag them, resulting in messy (but still legal) code. Also, this needs a little fixing: the dictionary of items needs to return keys in size order, rather than just randomly.

 1     #! /usr/bin/env python
 3     'A script for ecto that adds abbr and acronym tags to the text'
 5     import sys, re
 7     acronyms={'WYSIWYG':'What You See Is What You Get', 
 8               'DOM':'Document Object Model',
 9               'XHTML':'eXtensible HyperText Markup Language',
10               'NSLU2':'[Linksys] Network Storage Link (USB) 2.0'
11              }
13     # get input data - depends on implementation.  For ecto:
14     data = open(sys.argv[1]).read()
16     # replace only the first instance of each acronym/abbreviation
17     for each in acronyms:
18         d = re.search(r'\b%s\b' % each, data)
19         if d:
20             data = data[:d.start()] + '<acronym title="' + \
21                    acronyms[each] + '">' + \
22                    each + '</acronym>' + data[d.end():]
24     #return data to ecto
25     open(sys.argv[1],'w').write(data)

ecto: Auto abbr/acronym

There are a couple of instances of scripts out there that automatically apply abbr and acronym tags to pages, but I wanted to be able to do the same in ecto. This is also the first time I wrote a plugin script for ecto, and I wanted to do it in python. Please note that this here script is untested until I get onto my Mac and test the hell out of it. The script works, with the caveat listed in TODO.

 1     #! /usr/bin/env python
 3     'A script for ecto that adds abbr and acronym tags to the text'
 5     TODO = '''
 6     Fix it so that acronyms without a space either side (for example,
 7     that finish a sentence) work.
 9     Lookup on the internet for a list of acronyms/abbreviations?
10     '''
12     acronyms={'WYSIWYG':'What You See Is What You Get',
13               'DOM':'Document Object Model'}
14     abbrs={'XHTML':'eXtensible HyperText Markup Language',
15            'NSLU2':'[Linksys] Network Storage Link (USB) 2.0'}
17     # Add more values to your hearts content…
19     # get input data - depends on implementation.  For ecto:
20     import sys
21     data = open(sys.argv[1]).read()
23     # replace only the first instance of each acronym/abbreviation
24     for each in acronyms:
25         data.replace(' '+each+' ', '<acronym title="'+acronyms[each]+'">'+each+'</acronym>',1)
26     for each in abbrs:
27         data.replace(' '+each+' ', '<abbr title="'+abbrs[each]+'">'+each+'</abbr>',1)
29     #return data to ecto
30     open(sys.argv[1],'w').write(data)

Note: Comments turned off: too much Spam on this entry.

Find tracks not in iTunes library

I’ve written a python script that walks through a path tree, checking to see if each file in the tree is a track in the current user’s iTunes Music Library.xml file.

 1     import os, re
 3     startpath = "/Volumes/Media/Music/"
 4     prefix = "file://localhost"
 5     library = os.path.expanduser("~")+"/Music/iTunes/iTunes Music Library.xml"
 7     def eachpath(arg, path, tracks):
 8         for track in tracks:
 9             if os.path.isfile(os.path.abspath(path)+'/'+track):
10                 trackpath = os.path.join(os.path.abspath(path),track)
11                 grepstr = prefix+trackpath.replace(" ","%20")
12                 if grepstr not in data:
13                     arg.append(grepstr)
15     data = open(library).read()
16     missing = []
17     os.path.walk(startpath, eachpath, missing)
19     print missing

It’s not flawless - on my machine it eats up over 11 meg of memory, and takes ages to run, but as a proof of concept, it works okay. The memory it uses is mostly because is stores the whole iTunes library file in memory, so that’s 9 meg on my system already. The main loop is doing a string1 not in string2, which is probably not optimal, but it was easy to code, for now. I’m still waiting to see how long it takes to do my whole library, but I’m getting bored with waiting. Edit: to reduce the time taken, I used the following code in the final if clause in the function:

1 try:
2     if not re.search(grepstr, data):
3         arg.append(grepstr)
4 except:
5     if grepstr not in data:
6         arg.append(grepstr)

The re one is much faster, but fails in some cases: the second one, while slower, is a fallback. There are also some other issues, at this stage I have not cared that much about escaped characters, which iTunes uses when storing the information. But, I came up with a quicker method than python’s os.path.walk(). Using the find command is much quicker:

1     find /Volumes/Media/Music -type f -not -name .aacgained -not -name ._* -not -name .DS_Store

takes between 12-36 seconds for my 5700+ library stored on my NSLU2. If I telnet into the NSLU2 and run the equivalent command:

1     find ~media/Music -type f -not -name .aacgained -not -name ._* -not -name .DS_Store

it takes on average less than one second to complete. So, that’s more than an order of magnitude, even if the network traffic is low. Oh, and it compares very favourably with the python version, which takes at least one minute to run.

Extracting Data from XML

Python does have tools for grepping XML files, but I’ve never been able to get them to work to my liking. I’ve generally just stripped out the data I need. And I will continue to do so, as it’s probably much faster than filtering through all of the crud I don’t need.

 1     library = os.path.expanduser('~')+'/Music/iTunes/iTunes Music Library.xml'
 2     data = open(library).readlines()
 4     tracks = {}
 5     this_track = 0
 6     for line in data:
 7         if line.count('<key>Track ID'):
 8             this_track = line.split('integer>')[1][:-2]
 9         elif line.count('<key>Location</key>'):
10             tracks[this_track] = urllib.url2pathname(line.split('string>')[1][16:-2]).replace('&#38;','&')

The above code will search through the library file, and grab info on each track: just the database ID, and the location (which is a URI, encoded to remove spaces and dodgy characters. The info is then put into a dictionary, where the key is the database ID, and the value is the location. Note that there is a replace() at the end of the last line - for some reason python’s urllib.url2pathname() function doesn’t replace & characters - I guess that’s because these aren’t really intended to be in a filename. Also, on my NSLU2 the extended characters are replaced by underscores, but I’m going to update to samba 3 (at the risk of mucking up the entire library…) to see if this fixes that issue. Anyway, after coding this, I had a bit of a think, and came up with the following method of doing the same (ensure it’s all on one line):

1     grep Location ~/Music/iTunes/iTunes\ Music\ Library.xml |
2       awk 'sub("<key>Location</key%gt;<string>file://localhost","",$1)' |
3       sed 'sx</string>xx'

The python version uses between 5-8 seconds of CPU time, the grep version around 1.5, but does not associate the database ID’s with the locations, which I need. It also looks to be much easier to do the changing of characters (%20, for instance, into a space) that I need to do so I can check to see if files exist. Actually, using urllib.urlopen(), I can use the escaped/quoted version to see if the file exists, but it might be slow.

(kind of) Fix for XMLRPC bug.

I’ve been playing around with the version of the WP-mu source that’s used on Blogsome’s servers, trying to find the exact point where the bug is that escapes apostrophes and quotes.

Basically, the XMLRPC client contacts the server, and sends the data in. According to the console, the content of the post is actually in a field called description.

Searching through the XMLRPC file I find only five references to the word description. Two of these are in functions to do with posting. Both are basically the same, one is for blogger, the other metaweblog type connections:

1     $post_content = apply_filters('content_save_pre', $content_struct['description']);

Now, a bit of research showed up that apply_filters is a function that allows plugins and their ilk to access the data before it gets saved to the database. Now, I’m fairly sure it is not a plugin doing this.

I also discovered that it is likely that the update to XMLRPC.php that happened was accompanied by a change to another file, that calls stripslashes(), another WP function. The XMLRPC update was, after all, a fix that removed the ability for XMLRPC calls to run unescaped code. So it makes sense that it escapes stuff.

In the short term, I discovered ecto has the ability to automatically run a script as you post: in the New Post window, make sure Options are showing, and choose the Formatting tab. (Incidentally, if you are only using double-quotes, it seems the Smarten Quotes will help, but it may mess with code).

I use a script that is like this to fix everything up:

1 import sys
2 data = open(sys.argv[1]).read()
4 data = data.replace('”,“')
5 data = data.replace('“',”“”)
7 open(sys.argv[1],'w').write(data)

This on its own is not enough - I seemed to have to go into the HTML editing mode before it would work. I think ecto does its own conversion of certain HTML entities to real characters.

This post is a test post to see how it all goes with <pre> tags and the like.

iTunes Shared Library Checker

I now have the code to check the library XML file and see if there are missing tracks (ie, the files are not where they are expected to be). This code is quite slow, but simple to follow. (It took around 2 min to run). I also have the code that gets a list of the files in the library directory. It must be simple to combine this information, and work out which ones are:

  1. Files that are ‘new’: they only exist on disk, not in the library. Chances are they were added by another user.
  2. Files that have become detached: there is a file and a library location, but they don’t quite match up. This is probably because the users have ‘Keep Library Arranged’ turned on, and one of them has made a change to a track name, artist or album; or made a change to the compilation flag.

The trick will be having the list of files, and removing items from the list that have been located in the library. This will leave list that just need to be sorted into alpha and beta above.

 1     import os
 2     import urllib
 4     library = os.path.expanduser('~')+'/Music/iTunes/iTunes Music Library.xml'
 5     startpath = '/Volumes/Media/Music'
 6     def greppy(library)
 7         data = open(library).readlines()
 8         tracks = {}
 9         this_track = 0
10         for line in data:
11             if line.count('<key>Track ID'):
12                 this_track = line.split('integer>')[1][:-2]
13             elif line.count('<key>Location</key>'):
14                 tracks[this_track] = urllib.url2pathname(
15                          line.split('string>')[1][16:-2]).replace(
16                          '&#38;','&')
18     findstr = "find "+startpath+"-type f -not -name .aacgained -not -name ._* -not -name .DS_Store | sort"
19     treedata = os.popen(findstr).readlines()
21     data = greppy(library)
23     missing = {}
24     surplus = treedata[:]
26     for i in data:
27         try:
28             surplus.remove(urllib.urlopen(data[i]).url[7:]+'\n')
29         except IOError:
30             missing[i] = data[i]
31         except ValueError:
32             pass

This leaves two data structures of interest: missing, a dictionary with the ‘missing tracks’ from iTunes, and surplus, a list with files that do not have an associated iTunes library entry. Note: I’ve turned off comments, as this post seems to get a lot of comment spam.

Multi-table Inheritance and the Django Admin

Django’s admin interface is a great way to be able to interact with your models without having to write any view code, and, within limits, it’s useful in production too. However, it can quickly get very crowded when you register lots of models.

Consider the situation where you are using Django’s multi-table inheritance:

from django.db import models

from model_utils.managers import InheritanceManager

class Sheep(models.Model):
    sheep_id = models.AutoField(primary_key=True)
    tag_id = models.CharField(max_length=32)
    date_of_birth = models.DateField()
    sire = models.ForeignKey('sheep.Ram', blank=True, null=True, related_name='progeny')
    dam = models.ForeignKey('sheep.Ewe', blank=True, null=True, related_name='progeny')

    objects = InheritanceManager()

    class Meta:
        verbose_name_plural = 'sheep'

    def __str__(self):
        return '{}: {}'.format(self._meta.verbose_name, self.tag_id)

class Ram(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ram'
        verbose_name_plural = 'rams'

class Ewe(Sheep):
    sheep = models.OneToOneField(parent_link=True)

    class Meta:
        verbose_name = 'ewe'
        verbose_name_plural = 'ewes'

Ignore the fact there is no specialisation on those child models: in practice you’d normally have some.

Also note that I’ve manually included the primary key, and the parent link fields. This has been done so that the actual columns in the database match, and in this case will all be sheep_id. This will make writing joins slightly simpler, and avoids the (not specific to Django) ORM anti-pattern of “always have a column named id”.

We can use the models like this, but it might be nice to have all sheep in the one admin changelist, and just allow filtering by subclass model.

First, we’ll put some extra stuff onto the parent model, to make obtaining the subclasses simpler. Some of these will use a new decorator, which creates a class version of the @property decorator.

class classproperty(property):
    def __get__(self, cls, owner):
      return self.fget.__get__(None, owner)()

class Sheep(models.Model):
    # Fields, etc. defined as above.

        "All known subclasses, keyed by a unique name per class."
        return {
          rel.name: rel.related_model
          for rel in cls._meta.related_objects
          if rel.parent_link

    def SUBCLASS_CHOICES(cls):
        "Available subclass choices, with nice names."
        return [
            (name, model._meta.verbose_name)
            for name, model in cls.SUBCLASS_OBJECT_CHOICES.items()

    def SUBCLASS(cls, name):
        "Given a subclass name, return the subclass."
        return cls.SUBCLASS_OBJECT_CHOICES.get(name, cls)

Note that we don’t need to enumerate the subclasses: adding a new subclass later in development will automatically add it to these properties, even though in this case it would be unlikely to happen.

From these, we can write some nice neat stuff to enable using these in the admin.

from django import forms
from django.conf.urls import url
from django.contrib import admin
from django.utils.translation import ugettext as _

from .models import Sheep

class SubclassFilter(admin.SimpleListFilter):
    title = _('gender')
    parameter_name = 'gender'

    def lookups(self, request, model_admin):
      return Sheep.SUBCLASS_CHOICES

    def queryset(self, request, queryset):
      if self.value():
        return queryset.exclude(**{self.value(): None})
      return queryset

class SheepAdmin(admin.ModelAdmin):
    list_display = [
    list_filter = [SubclassFilter]

    def get_queryset(self, request):
      return super(SheepAdmin, self).get_queryset(request).select_subclasses()

    def gender(self, obj):
        return obj._meta.verbose_name

    def get_form(self, request, obj=None, **kwargs):
        if obj is None:
            Model = Sheep.SUBCLASS(request.GET.get('gender'))
            Model = obj.__class__

        # When we change the selected gender in the create form, we want to reload the page.
        RELOAD_PAGE = "window.location.search='?gender=' + this.value"
        # We should also grab all existing field values, and pass them as query string values.

        class ModelForm(forms.ModelForm):
            if not obj:
                gender = forms.ChoiceField(
                    choices=[('', _('Please select...'))] + Sheep.SUBCLASS_CHOICES,
                    widget=forms.Select(attrs={'onchange': RELOAD_PAGE})

            class Meta:
                model = Model
                exclude = ()

        return ModelForm

    def get_fields(self, request, obj=None):
        # We want gender to be the first field.
        fields = super(SheepAdmin, self).get_fields(request, obj)

        if 'gender' in fields:
            fields = ['gender'] + fields

        return fields

    def get_urls(self):
        # We want to install named urls that match the subclass ones, but bounce to the relevant
        # superclass ones (since they should be able to handle rendering the correct form).
        urls = super(SheepAdmin, self).get_urls()
        existing = '{}_{}_'.format(self.model._meta.app_label, self.model._meta.model_name)
        subclass_urls = []
        for name, model in Sheep.SUBCLASS_OBJECT_CHOICES.items():
            opts = model._meta
            replace = '{}_{}_'.format(opts.app_label, opts.model_name)
                url(pattern.regex.pattern, pattern.callback, name=pattern.name.replace(existing, replace))
                for pattern in urls if pattern.name

        return urls + subclass_urls

Wow. There’s quite a lot going on there, but the summary is:

  • We create a custom filter that filters according to subclass.
  • The .select_subclasses() means that objects are downcast to their subclass when fetched.
  • There is a custom form, that, when in create mode, has a selector for the desired subclass.
  • When the subclass is changed (only on the create form), the page is reloaded. This is required in a situation where there are different fields on each of the subclass models.
  • We register the subclass admin url paths, but use the superclass admin views.

I’ve had ideas about this for some time, and have just started using something like this in development: in my situation, there will be an arbitrary number of subclasses, all of which will have several new fields. The code in this page is extracted (and changed) from those ideas, so may not be completely correct. Corrections welcome.

Using other Python versions with Codeship.

Codeship is pretty cool, other than their requirement to log in to view even public builds. They support Python to some extent, even going as far as creating and activating a virtualenv for your test environment.

However, I like to use tox to do matrix testing against packages, and try to cover as many cases as possible. For instance, for django-boardinghouse, I currently test against:

  • Python 2.7
  • Python 3.3
  • Python 3.4
  • Python 3.5
  • pypy
  • pypy3

…and Django 1.7 through 1.9. In most cases, each version of python should be tested with each version of django. In practice, there are some exceptions.

However, Codeship only have Python 2.7.6 and 3.4.0 installed.

You can run arbitrary code as part of your test/setup, but you can’t install stuff using sudo. Instead, I wrote a script that can be called from within the test setup that installs other pythons:

# We already have some versions of python, but want some more...
cd ~/src

mkdir -p pypy
cd pypy
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.0.1-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy-5.0.1-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p pypy3
cd pypy3
wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy3-2.4-linux_x86_64-portable.tar.bz2
tar --strip-components 1 -xvf pypy3-2.4-linux_x86_64-portable.tar.bz2
cd ..

mkdir -p ~/.local
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tar.xz
tar xvf Python-3.5.1.tar.xz
cd Python-3.5.1
./configure --prefix=/home/$USER/.local/
make install

# You actually need to put this line in the tests section. Not sure of a better solution.
export PATH=$PATH:~/src/pypy3/bin:~/src/pypy/bin:~/.local/bin/

I have this as a reusable snippet on BitBucket: codeship helper scripts, however as mentioned you need to grab the export PATH=... section and stick that in the tests section. Also notably you get a different URL for the raw version of each revision, which is actually really good, because it means someone cannot change the code between you inspecting it an executing it.

In my case, I have a line in the test setup:

curl https://bitbucket.org/\!api/2.0/snippets/schinckel/oKXKy/c7cc02bcd96d4a8f444cd997d5c3bc0bb92106d6/files/install-python.sh | sh

Also of note is that pypy* have a pre-built version, which is much faster than building from source, however there doesn’t seem to be a non-rpm version of Python 3.5.

Tree data as a nested list

One of the nice things about Adjacency Lists as a method of storing tree structures is that there is not much redundancy: you only store a reference to the parent, and that’s it.

It does mean that getting that data in a nested object is a bit complicated. I’ve written before about getting data out of a database: I’ll revisit that again I’m sure, but for now, I’m going to deal with data that has the following shape: that is, has been built up into a Materialized Path:

    "node": 1,
    "ancestors": [],
    "label": "Australia"
    "node": 2,
    "ancestors": [1],
    "label": "South Australia"
    "node": 3,
    "ancestors": [1],
    "label": "Victoria"
    "node": 4,
    "ancestors": [1, 2],
    "label": "South-East"
    "node": 5,
    "ancestors": [1, 3],
    "label": "Western Districts"
    "node": 6,
    "ancestors": [],
    "label": "New Zealand"
    "node": 7,
    "ancestors": [1, 2],
    "label": "Barossa Valley"
    "node": 8,
    "ancestors": [1, 2],
    "label": "Riverland"

From here, we want to build up something that looks like:

  • Australia
    • South Australia
      • Barossa Valley
      • Riverland
      • South East
    • Victoria
      • Western Districts
  • New Zealand

Or, a nested python data structure:

  ('Australia', [
    ('South Australia', [
      ('Barossa Valley', []),
      ('Riverland', []),
      ('South-East', [])
    ('Victoria', [
      ('Western Districts', [])
  ('New Zealand', [])

You’ll see that each node is a 2-tuple, and each set of siblings is a list. Even a node with no children still gets an empty list.

We can build up this data structure in two steps: based on the fact that a dict, as key-value pairs, matches a 2-tuple. That is, we will start by creating:

  1: {
    2: {
      4: {},
      7: {},
      8: {},
    3: {
      5: {},
  6: {},

You might be reaching for python’s defaultdict class at this point, but there is a slightly nicer way:

class Tree(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

(Note: This class, and the seed of the idea, came from this answer on StackOverflow).

We can also create a recursive method on this class that creates a node and all of it’s ancestors:

    def insert(self, key, ancestors):
        if ancestors:
            self[ancestors[0]].insert(key, ancestors[1:])
>>> tree = Tree()
>>> for node in data:
...     tree.insert(node['node'], node['ancestors'])
>>> print tree
{1: {2: {8: {}, 4: {}, 7: {}}, 3: {5: {}}}, 6: {}}

Looking good.

Let’s make another method that allows us to actually insert the labels (and apply a sort, if we want):

    def label(self, label_dict, sort_key=lambda x: x[0]):
        return sorted([
          (label_dict.get(key), value.label(label_dict, sort_key))
          for key, value in self.items()
        ], key=sort_key)

We also need to build up the simple key-value store to pass as label_dict, but that’s pretty easy.

Let’s look at the full code: first the complete class:

class Tree(dict):
    """Simple Tree data structure

    Stores data in the form:

        "a": {
            "b": {},
            "c": {},
        "d": {
            "e": {},

    And can be nested to any depth.

    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

    def insert(self, node, ancestors):
        """Insert the supplied node, creating all ancestors as required.

        This expects a list (possibly empty) containing the ancestors,
        and a value for the node.
        if not ancestors:
            self[ancestors[0]].insert(node, ancestors[1:])

    def label(self, labels, sort_key=lambda x: x[0]):
        """Return a nested 2-tuple with just the supplied labels.

        Realistically, the labels could be any type of object.
        return sorted([
                value.label(labels, sort_key)
            ) for key, value in self.items()
        ], key=sort_key)

Now, using it:

>>> tree = Tree()
>>> labels = {}
>>> for node in data:
>>>     tree.insert(node['node'], node['ancestors'])
>>>     labels[node['node']] = node['label']
>>> from pprint import pprint
>>> pprint(tree.label(labels))

  [('South Australia',
    [('Barossa Valley', []), ('Riverland', []), ('South-East', [])]),
   ('Victoria', [('Western Districts', [])])]),
 ('New Zealand', [])]

Awesome. Now use your template rendering of choice to turn this into a nicely formatted list.

(Directly) Testing Django Formsets

Django Forms are excellent: they offer a really nice API for validating user input. You can quite easily pass a dict of data instead of a QueryDict, which is what the request handling mechanism provides. This makes it trivial to write tests that exercise a given Form’s validation directly. For instance:

def test_my_form(self):
    form = MyForm({
        'foo': 'bar',
        'baz': 'qux'
    self.assertTrue('foo' in form.errors)

Formsets are also really nice: they expose a neat way to update a group of homogenous objects. It’s possible to pass a list of dicts to the formset for the initial argument, but, alas, you may not do the same for passing data. Instead, it needs to be structured as the QueryDict would be:

def test_my_formset(self):
    formset = MyFormSet({
        'formset-INITIAL_FORMS': '0',
        'formset-TOTAL_FORMS': '2',
        'formset-0-foo': 'bar1',
        'formset-0-baz': 'qux1',
        'formset-1-foo': 'spam',
        'formset-1-baz': 'eggs'

This is fine if you only have a couple of forms in your formset, but it’s a bit tiresome to have to put all of the prefixes, and is far noisier.

Here’s a nice little helper, that takes a FormSet class, and a list (of dicts), and instantiates the formset with the data coerced into the correct format:

def instantiate_formset(formset_class, data, instance=None, initial=None):
    prefix = formset_class().prefix
    formset_data = {}
    for i, form_data in enumerate(data):
        for name, value in form_data.items():
            if isinstance(value, list):
                for j, inner in enumerate(value):
                    formset_data['{}-{}-{}_{}'.format(prefix, i, name, j)] = inner
                formset_data['{}-{}-{}'.format(prefix, i, name)] = value
    formset_data['{}-TOTAL_FORMS'.format(prefix)] = len(data)
    formset_data['{}-INITIAL_FORMS'.format(prefix)] = 0

    if instance:
        return formset_class(formset_data, instance=instance, initial=initial)
        return formset_class(formset_data, initial=initial)

This handles a formset or a model formset. Much easier to use:

def test_my_formset(self):
    formset = instantiate_formset(MyFormSet, [
        'foo': 'bar1',
        'baz': 'qux1',
        'foo': 'spam',
        'baz': 'eggs',