Aggregating Ranges in Postgres
-
Comments:
- here.
Postgres’ range types are very cool. You can use them to store, as you may guess, a value that is a range. There are several included range types, and it’s possible to create your own range type. For now, we’ll just look at using a simple int4range
: although everything in principle could be applied to any range type.
Firstly, a quick discussion of how the range types work.
There is a literal form, and then a functional form:
SELECT '(2,17]'::int4range;
int4range
-----------
[3,18)
(1 row)
SELECT int4range(2, 17, '(]');
int4range
-----------
[3,18)
(1 row)
The first thing you’ll notice is that postgres converts them to canonical form. We can provide the bound-types: inclusive or exclusive upper and lower values. This is the same notation you might see if you have done some mathematics.
You can also omit the upper and/or lower bounds to get a range object that goes to ±Infinity. Finally, you can also have empty ranges.
There are several operations that can be performed on a range. The most interesting one from the perspective of this article is the UNION
operation.
SELECT '[3,17)'::int4range + '[10,20)';
?column?
----------
[3,20)
(1 row)
You’ll get an error if your union does not result in a contiguous range. There is no way to store a discontinuous range in postgres (but you could store them in an array of int4range[]
, for instance).
What about if we want to get the aggregate range from a set of ranges?
SELECT value FROM range_test ;
value
---------
[2,3)
[3,4)
[11,12)
[5,12)
[4,5)
(4 rows)
What we’d like to be able to do is something like:
SELECT aggregate_union(value) FROM range_test;
aggregate_union
-----------------
'[2,12)'
Obviously, this would be subject to the same limitation on union: we would get an error (or a NULL) if the aggregate range was not continuous.
Perhaps even more useful might be a function that would show us what ranges are missing from a set of ranges. With the same data above, we might see:
SELECT missing_ranges(value) FROM range_test ; missing_ranges
------------------
{"(,2)","[12,)"}
(1 row)
Indeed, the latter is probably more useful, and, as it turns out, is simpler to perform.
We’ll start with the aggregate_union
though, because it’s fun. It’s also the way I worked out the nicer solution for the last problem.
We need to create a postgres AGGREGATE to, well, aggregate the data from columns in a table. A naïve solution might be:
CREATE FUNCTION _aggregate_union(int4range, int4range)
RETURNS int4range AS $$
SELECT COALESCE(
$1 + $2, $1, $2
);
$$ LANGUAGE SQL;
CREATE AGGREGATE aggregate_union (int4range) (
sfunc = _aggregate_union,
stype = int4range
);
This actually works to some degree: but only if the ranges in the query are already sorted, as each iteration of the aggregation function must result in a valid range:
SELECT aggregate_union(value) FROM (SELECT value FROM range_test ORDER BY value) x;
aggregate_union
-----------------
[2,12)
(1 row)
However, if the data is not sorted, it will fail.
Instead, we have to either collect all of the items in an array, sort this, and then attempt to aggregate, or, at each step, aggregate as much as possible, and add the current element to the array if we cannot perform a union.
The simpler of these (but will take more memory) is to stick them all into an array, and then sort and apply the union. We only need to define one new function to do it this way:
CREATE OR REPLACE FUNCTION _aggregate_union(int4range[])
RETURNS int4range AS $$
DECLARE
_range int4range;
_current int4range;
BEGIN
_current := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);
FOR _range IN SELECT unnest($1) x ORDER BY x LOOP
IF _range && _current OR _range -|- _current THEN
_current := _current + _range;
ELSE
RETURN NULL;
END IF;
END LOOP;
RETURN _current;
END;
$$ LANGUAGE plpgsql;
CREATE AGGREGATE aggregate_union (int4range) (
stype = int4range[],
sfunc = array_append,
finalfunc = _aggregate_union
);
Lets test it out:
SELECT aggregate_union(value) FROM range_test;
aggregate_union
-----------------
[2,12)
(1 row)
Bingo.
But what about the other case? What if we care more about what data is missing?
After spending way too many hours on playing around with this, I hit on the idea of using a window function, lead
to get the data from the “next” row in the query.
SELECT
lower(value),
upper(value),
lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
FROM
range_test
ORDER BY value;
This gets us most of the way. We can see the ones with upper
>= lead
indicate there is no gap between the ranges, so we can filter. However, we need to do this with a subquery to be able to get access to the columns correctly:
SELECT upper, lead FROM (
SELECT
lower(value),
upper(value),
lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
FROM
range_test
ORDER BY value
) x WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL);
We can aggregate these into an array, and then prefix with an element if our first object is not infinite-lower-bounded.
CREATE OR REPLACE FUNCTION missing_ranges(int4range[])
RETURNS int4range[] AS $$
DECLARE
_range int4range;
_missing int4range[];
BEGIN
_missing := (SELECT
array_agg(int4range(upper, lead, '[)'))
FROM (
SELECT lower(x), upper(x), lead(lower(x)) OVER (ORDER BY lower(x) NULLS FIRST)
FROM unnest($1) x ORDER BY lower NULLS FIRST
) x
WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL)
);
_range := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);
IF NOT lower_inf(_range) THEN
_missing := array_prepend(int4range(NULL, lower(_range), '[)'), _missing);
END IF;
RETURN _missing;
END;
$$ LANGUAGE plpgsql;
CREATE AGGREGATE missing_ranges (int4range) (
sfunc = array_append,
stype = int4range[],
finalfunc = missing_ranges
);
All too easy.
It is possible to rewrite this function just using SQL, but it’s not pretty:
SELECT array_agg(value)
FROM
(SELECT value
FROM
(SELECT int4range(UPPER, lead, '[)') AS value
FROM
(SELECT NULL AS LOWER,
NULL::integer AS UPPER,
LOWER(a.value) AS lead
FROM range_test a
ORDER BY a.value LIMIT 1) w
WHERE lead IS NOT NULL
UNION SELECT int4range(UPPER, lead, '[)')
FROM
(SELECT LOWER(b.value),
UPPER(b.value),
lead(LOWER(b.value)) OVER (ORDER BY LOWER(b.value) NULLS FIRST)
FROM range_test b
ORDER BY b.value) x
WHERE UPPER IS NOT NULL
AND (LEAD IS NULL OR UPPER < lead)
) y
ORDER BY value) z;
I haven’t tried to see which is faster.