Aggregating ranges in Python

2014-11-21 @ 12:35:53

python

here.

This is a bit of a follow-up to Aggregating Ranges in Postgres.

Since we don’t have a nice range type in Python, we will just use a tuple that contains a lower and upper bound. We will assume that this is a canonical form: the lower bound is inclusive, the upper bound is non-inclusive. We will also assume (for simplicity) that the 0-th item is always less than the 1-th item in the tuple.

Given a list of these range-tuples, we want to do the two things we saw in the previous post.

Aggregate them into the simplest possible array of range-tuples, using a union operation.
Determine the parts that are missing.

We will use in in a similar way:

>>> data =[(2,3),(3,4),(11,12),(5,12),(4,5)]
>>> aggregate_union(data)
(2, 12)
>>> missing_ranges(data)
[(None, 2), (12, None)]

Luckily, None sorts before any integer in python, so we will just be able to use the normal sort.

def aggregate_union(data):
    if not data:
      return None

    sorted_data = sorted(data)
    result = sorted_data[0]

    for lower, upper in sorted_data[1:]:
        # If we ever find result[1] is None, we know it covers any
        # other possible values, so we can stop at that point.
        if result[1] is None:
            break

        if lower > result[1]:
            return None  # Or raise an exception, if you want.

        result = (result[0], max(upper, result[1]))

    return result

The alternative, missing_ranges(data) takes cues from the SQL version too.

def missing_ranges(data):
    if not data:
      return (None, None)

    result = []
    # We do a little fancy stuff here: append an extra item that
    # mimics what we were able to use lead for, but in a different
    # way so we can use [i + 1] later.
    sorted_data = sorted(data) + [(None, None)]

    if sorted_data[0][0] is not None:
        # Note: the upper bound here is not quite correct!
        result.append((None, sorted_data[0][1]))

    for i, (lower, upper) in enumerate(sorted_data):
        # Grab the next item in our sorted list. Normally, we would check
        # to see if we are at the end, but we will rely on the break later
        # on: we can always stop processing when the upper bound of our
        # next item is `None`
        next = sorted_data[i + 1]

        if upper < next[0] or (next[0] is None and upper is not None):
            result.append((upper, next[0]))

        # Now, exit before we ever get to a bounds error on getting the next item.
        if next[1] is None:
          break

    return result

However, there is a problem here that is nicely solved by the Range types within postgres:

>>> missing_ranges(data)
[(None, 3), (12, None)]

We need to subtract from the first object’s upper bound. Which is easy with integer tuple-ranges (basically, any discrete range type), but not so much with continuous ranges.