Extracting Data from XML

Python does have tools for grepping XML files, but I’ve never been able to get them to work to my liking. I’ve generally just stripped out the data I need. And I will continue to do so, as it’s probably much faster than filtering through all of the crud I don’t need.

    library = os.path.expanduser('~')+'/Music/iTunes/iTunes Music Library.xml'
    data = open(library).readlines()
        
    tracks = {}
    this_track = 0
    for line in data:
        if line.count('<key>Track ID'):
            this_track = line.split('integer>')[1][:-2]
        elif line.count('<key>Location</key>'):
            tracks[this_track] = urllib.url2pathname(line.split('string>')[1][16:-2]).replace('&#38;','&')

The above code will search through the library file, and grab info on each track: just the database ID, and the location (which is a URI, encoded to remove spaces and dodgy characters. The info is then put into a dictionary, where the key is the database ID, and the value is the location. Note that there is a replace() at the end of the last line - for some reason python’s urllib.url2pathname() function doesn’t replace & characters - I guess that’s because these aren’t really intended to be in a filename. Also, on my NSLU2 the extended characters are replaced by underscores, but I’m going to update to samba 3 (at the risk of mucking up the entire library…) to see if this fixes that issue. Anyway, after coding this, I had a bit of a think, and came up with the following method of doing the same (ensure it’s all on one line):

    grep Location ~/Music/iTunes/iTunes\ Music\ Library.xml |
      awk 'sub("<key>Location</key%gt;<string>file://localhost","",$1)' |
      sed 'sx</string>xx'

The python version uses between 5-8 seconds of CPU time, the grep version around 1.5, but does not associate the database ID’s with the locations, which I need. It also looks to be much easier to do the changing of characters (%20, for instance, into a space) that I need to do so I can check to see if files exist. Actually, using urllib.urlopen(), I can use the escaped/quoted version to see if the file exists, but it might be slow.