Extracting Data from XML

Python does have tools for grepping XML files, but I’ve never been able to get them to work to my liking. I’ve generally just stripped out the data I need. And I will continue to do so, as it’s probably much faster than filtering through all of the crud I don’t need.

 1     library = os.path.expanduser('~')+'/Music/iTunes/iTunes Music Library.xml'
 2     data = open(library).readlines()
 3         
 4     tracks = {}
 5     this_track = 0
 6     for line in data:
 7         if line.count('<key>Track ID'):
 8             this_track = line.split('integer>')[1][:-2]
 9         elif line.count('<key>Location</key>'):
10             tracks[this_track] = urllib.url2pathname(line.split('string>')[1][16:-2]).replace('&#38;','&')

The above code will search through the library file, and grab info on each track: just the database ID, and the location (which is a URI, encoded to remove spaces and dodgy characters. The info is then put into a dictionary, where the key is the database ID, and the value is the location. Note that there is a replace() at the end of the last line - for some reason python’s urllib.url2pathname() function doesn’t replace & characters - I guess that’s because these aren’t really intended to be in a filename. Also, on my NSLU2 the extended characters are replaced by underscores, but I’m going to update to samba 3 (at the risk of mucking up the entire library…) to see if this fixes that issue. Anyway, after coding this, I had a bit of a think, and came up with the following method of doing the same (ensure it’s all on one line):

1     grep Location ~/Music/iTunes/iTunes\ Music\ Library.xml |
2       awk 'sub("<key>Location</key%gt;<string>file://localhost","",$1)' |
3       sed 'sx</string>xx'

The python version uses between 5-8 seconds of CPU time, the grep version around 1.5, but does not associate the database ID’s with the locations, which I need. It also looks to be much easier to do the changing of characters (%20, for instance, into a space) that I need to do so I can check to see if files exist. Actually, using urllib.urlopen(), I can use the escaped/quoted version to see if the file exists, but it might be slow.

Find Missing Tracks

Continuing on from my last programming related post, I’ve been giving some thought to the opposite problem: when tracks are in the iTunes Library, but the file they refer to is not there. iTunes marks these with an exclaimation mark in front of the track, but only if it realises they are in fact missing. Often, just clicking on a track does not mean iTunes knows that track is missing. A Get Info on such a track prompts the user for a choice between removing the track from the library, or locating it. It might be possible to use the following AppleScript paradigm to get a list of the tracks that are missing:

1 tell application "iTunes"
2     set theTracks to every track in playlist 1
3     set missing to {}
4     repeat with theTrack in theracks
5         if location of theTrack is [unknown] then
6             set missing to missing + theTrack
7         end if
8     end repeat
9 end tell

Of course, that would be dead slow, as it relies on a couple of AppleScript events for each track. (Also, it may not work, but some sort of try - else - end try clause might work.) A better solution might be to sort through the iTunes Music Library.xml file.

 1     import os
 2         
 3     data = open("iTunes Music Library.xml").readlines()
 4     missing = []
 5         
 6     for line in data:
 7         if "<location>" in line:
 8             #extract location
 9             location = line.split("/Volumes")[1].split("</location>")[0]
10             #check if location is not a file
11             if not os.path.isfile(location):
12                 missing.append(location)

Assuming that the mount point is the same as the last time the XML file was written (and iTunes seems to be pretty clever about working out things like /Volumes/Media vs. /Volumes/Media-1, then your list missing should contain a list of files that are in the library, but not where iTunes expects them to be. It might be possible to then match these two lists (from this script), and see which ones are likely to be ‘modified tracks’, where one user has changed the track name, artist or album. It may even be worthwhile having a list of all track locations, and seeing if there are any duplicates, that will solve the problem of duplicates int he library that are the same file. I’d like to have python parse the XML file, and create data structures reflecting the XML data - but skip everything but the database ID and filename, as these are the things that we really need - but I haven’t had much luck with this type of process in the past. I jsut had a thought on how to do it, though.

Sing-along Music

Note to self: Classical Music is good public transport music. Why? Because you are less likely to sing along. People look strangely at you when you sing on the train, for some reason.