Merging Adjacent Ranges in Postgres

Previously, I detailed a solution to split/trim/replace overlapping items in a table. Subsequently, I decided I needed to merge all adjacent items that could be merged. In this case, that was with two other fields (only one of which was subject to the exclusion constraint) being identical in adjacent periods.

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE team_membership (
  membership_id SERIAL,
  player_id INTEGER,
  team_id INTEGER,
  period DATERANGE,
  CONSTRAINT prevent_overlapping_memberships EXCLUDE USING gist(player_id WITH =, period WITH &&)
);

Before we can implement the plpgsql trigger function, we need to tell Postgres how to aggregate ranges:

CREATE AGGREGATE sum(anyrange) (
  stype = anyrange,
  sfunc = range_union
);

We should note at this point that range_union, or + (hence the reason I’ve called it SUM) will fail with an error if the two ranges that are being combined do not overlap or touch. We must make sure that in any queries where we are going to use it, that all of the ranges will overlap (and I believe they also must be in “order”, so that as we perform the union on each range in a “reduce” manner we never end up with non-contiguous ranges).

So, let’s look at the trigger function. Initially, I wrote this as two queries:

CREATE OR REPLACE FUNCTION merge_adjacent()
  RETURNS TRIGGER AS $$
  BEGIN
    NEW.period = (SELECT SUM(period) FROM (SELECT NEW.period UNION ALL ...));
    DELETE FROM team_membership ...;
    RETURN NEW;
  END;
  $$ LANGUAGE plpgsql STRICT;

This required me to duplicate the WHERE clauses, and was messy.

Then I remembered you can use the RETURNING clause, and use a CTE, with a SELECT INTO:

CREATE OR REPLACE FUNCTION merge_adjacent()
  RETURNS TRIGGER AS $$

  BEGIN
    WITH matching AS (
      DELETE FROM team_membership mem
            WHERE mem.player_id = NEW.player_id
              AND mem.team_id = NEW.team_id
              AND (mem.period -|- NEW.period OR mem.period && NEW.period)
              AND mem.membership_id <> NEW.membership_id
        RETURNING period
    )
    SELECT INTO NEW.period (
      SELECT SUM(period) FROM (
        SELECT NEW.period
         UNION ALL
        SELECT period FROM matching
        ORDER BY period
    ) _all
    );
    RETURN NEW;
  END;

  $$ LANGUAGE plpgsql STRICT;

CREATE TRIGGER merge_adjacent
BEFORE INSERT OR UPDATE ON team_membership
FOR EACH ROW EXECUTE PROCEDURE merge_adjacent();

The other thing to note about this construct is that it will only work on “already merged” data: if you had ranges:

[2019-01-01, 2019-01-04)
[2019-01-04, 2019-02-02)
# Note there is a gap here...
[2019-05-01, 2019-05-11)
[2019-05-11, 2020-01-01)

and you added in a value to the missing range:

INSERT INTO range (period) VALUES ('[2019-02-02, 2019-05-01)')

You would not merge all of the ranges, only those immediately adjacent. That is, you would wind up with rows:

[2019-01-01, 2019-01-04)
[2019-01-04, 2019-05-11)
[2019-05-11, 2020-01-01)

However, if this trigger is active on the table you would never get to the stage where your data was adjacent but not merged.


This post was updated on 2021-01-04 to ensure that updates to a single row will not result in an error.

Aggregating Ranges in Postgres

Postgres’ range types are very cool. You can use them to store, as you may guess, a value that is a range. There are several included range types, and it’s possible to create your own range type. For now, we’ll just look at using a simple int4range: although everything in principle could be applied to any range type.

Firstly, a quick discussion of how the range types work.

There is a literal form, and then a functional form:

SELECT '(2,17]'::int4range;

  int4range
 -----------
  [3,18)
 (1 row)

SELECT int4range(2, 17, '(]');

 int4range
-----------
 [3,18)
(1 row)

The first thing you’ll notice is that postgres converts them to canonical form. We can provide the bound-types: inclusive or exclusive upper and lower values. This is the same notation you might see if you have done some mathematics.

You can also omit the upper and/or lower bounds to get a range object that goes to ±Infinity. Finally, you can also have empty ranges.

There are several operations that can be performed on a range. The most interesting one from the perspective of this article is the UNION operation.

SELECT '[3,17)'::int4range + '[10,20)';

 ?column?
----------
 [3,20)
(1 row)

You’ll get an error if your union does not result in a contiguous range. There is no way to store a discontinuous range in postgres (but you could store them in an array of int4range[], for instance).

What about if we want to get the aggregate range from a set of ranges?

SELECT value FROM range_test ;
  value
---------
 [2,3)
 [3,4)
 [11,12)
 [5,12)
 [4,5)
(4 rows)

What we’d like to be able to do is something like:

SELECT aggregate_union(value) FROM range_test;
 aggregate_union
-----------------
     '[2,12)'

Obviously, this would be subject to the same limitation on union: we would get an error (or a NULL) if the aggregate range was not continuous.

Perhaps even more useful might be a function that would show us what ranges are missing from a set of ranges. With the same data above, we might see:

SELECT missing_ranges(value) FROM range_test ;                                   missing_ranges
------------------
 {"(,2)","[12,)"}
(1 row)

Indeed, the latter is probably more useful, and, as it turns out, is simpler to perform.

We’ll start with the aggregate_union though, because it’s fun. It’s also the way I worked out the nicer solution for the last problem.

We need to create a postgres AGGREGATE to, well, aggregate the data from columns in a table. A naïve solution might be:

CREATE FUNCTION _aggregate_union(int4range, int4range)
RETURNS int4range AS $$

SELECT COALESCE(
  $1 + $2, $1, $2
);

$$ LANGUAGE SQL;

CREATE AGGREGATE aggregate_union (int4range) (
  sfunc = _aggregate_union,
  stype = int4range
);

This actually works to some degree: but only if the ranges in the query are already sorted, as each iteration of the aggregation function must result in a valid range:

SELECT aggregate_union(value) FROM (SELECT value FROM range_test ORDER BY value) x;
 aggregate_union
-----------------
 [2,12)
(1 row)

However, if the data is not sorted, it will fail.

Instead, we have to either collect all of the items in an array, sort this, and then attempt to aggregate, or, at each step, aggregate as much as possible, and add the current element to the array if we cannot perform a union.

The simpler of these (but will take more memory) is to stick them all into an array, and then sort and apply the union. We only need to define one new function to do it this way:

CREATE OR REPLACE FUNCTION _aggregate_union(int4range[])
RETURNS int4range AS $$

DECLARE
  _range int4range;
  _current int4range;

BEGIN

  _current := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  FOR _range IN SELECT unnest($1) x ORDER BY x LOOP
    IF _range && _current OR _range -|- _current THEN
      _current := _current + _range;
    ELSE
      RETURN NULL;
    END IF;
  END LOOP;

  RETURN _current;
  END;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE aggregate_union (int4range) (
  stype = int4range[],
  sfunc = array_append,
  finalfunc = _aggregate_union
);

Lets test it out:

SELECT aggregate_union(value) FROM range_test;
 aggregate_union
-----------------
 [2,12)
(1 row)

Bingo.

But what about the other case? What if we care more about what data is missing?

After spending way too many hours on playing around with this, I hit on the idea of using a window function, lead to get the data from the “next” row in the query.

SELECT
  lower(value),
  upper(value),
  lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
FROM
  range_test
ORDER BY value;

This gets us most of the way. We can see the ones with upper >= lead indicate there is no gap between the ranges, so we can filter. However, we need to do this with a subquery to be able to get access to the columns correctly:

SELECT upper, lead FROM (
  SELECT
    lower(value),
    upper(value),
    lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
  FROM
    range_test
  ORDER BY value
) x WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL);

We can aggregate these into an array, and then prefix with an element if our first object is not infinite-lower-bounded.

CREATE OR REPLACE FUNCTION missing_ranges(int4range[])
RETURNS int4range[] AS $$

DECLARE
  _range int4range;
  _missing int4range[];

BEGIN
  _missing := (SELECT
    array_agg(int4range(upper, lead, '[)'))
    FROM (
      SELECT lower(x), upper(x), lead(lower(x)) OVER (ORDER BY lower(x) NULLS FIRST)
      FROM unnest($1) x ORDER BY lower NULLS FIRST
    ) x
    WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL)
  );

  _range := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  IF NOT lower_inf(_range) THEN
    _missing := array_prepend(int4range(NULL, lower(_range), '[)'), _missing);
  END IF;

  RETURN _missing;
END;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE missing_ranges (int4range) (
  sfunc = array_append,
  stype = int4range[],
  finalfunc = missing_ranges
);

All too easy.

It is possible to rewrite this function just using SQL, but it’s not pretty:

SELECT array_agg(value)
FROM
  (SELECT value
   FROM
     (SELECT int4range(UPPER, lead, '[)') AS value
      FROM
        (SELECT NULL AS LOWER,
                NULL::integer AS UPPER,
                LOWER(a.value) AS lead
         FROM range_test a
         ORDER BY a.value LIMIT 1) w
      WHERE lead IS NOT NULL
      UNION SELECT int4range(UPPER, lead, '[)')
      FROM
        (SELECT LOWER(b.value),
                UPPER(b.value),
                lead(LOWER(b.value)) OVER (ORDER BY LOWER(b.value) NULLS FIRST)
         FROM range_test b
         ORDER BY b.value) x
      WHERE UPPER IS NOT NULL
        AND (LEAD IS NULL OR UPPER < lead)
      ) y
   ORDER BY value) z;

I haven’t tried to see which is faster.