Bowled over by Postgres

There’s a nice article at Bowled Over by SQL Window Functions by a chap called Dwain Camps. It’s written from the perspective of T-SQL, which has some differences to Postgres’s DDL and querying. I’ve reworked his stuff into what feels nice for me from a Postgressy perspective.

I’ll recap a couple of things he mentions, but you’ll probably want to head there and read that first.

  • Strike: all pins knocked down on the first ball of a frame. Scores 10 (from this frame), plus the total of whatever the next two balls knock down.
  • Spare: all pins knocked down on the second ball of a frame. Scores 10 (from this frame), plus how ever many pins you knock down on the next ball.
  • Open: at least one pin remains standing at the end of the frame. Score is how ever many pins you knocked down.

By convention, only the running tally is shown on the scoresheet. I’ve kept the frame score as the score for this frame only, and the total will contain the running total.


The first thing I do a bit differently is to use Postgres’ DOMAIN structures, which enables us to remove some of the check constraints, and simplify some others:

CREATE SCHEMA bowling;

CREATE DOMAIN bowling.frame_number AS integer
  CHECK ('[1,10]'::int4range @> VALUE)
  NOT NULL;

CREATE DOMAIN bowling.ball AS integer
  CHECK ('[0,10]'::int4range @> VALUE);

So, now we have two integer domains: the frame number may be between 1 and 10, and the ball pin count may be null, or between 0 and 10.

We’ll start by recreating the table structure: initially without constraints:

CREATE TABLE bowling.frame
(
  game_id INTEGER NOT NULL,
  player_id INTEGER NOT NULL,
  frame bowling.frame_number NOT NULL,
  ball1 bowling.ball NOT NULL,
  ball2 bowling.ball NULL,
  ball3 bowling.ball NULL,
  PRIMARY KEY (game_id, player_id, frame)
);

Not much different to the original, other than using those fresh new domain types.

The other approach I’ve used is to use more constraints, but make them simpler. I’m also relying on the fact that X + NULL => NULL, which means we can leave off a heap of the constraint clauses.

We’ll start by preventing the (non-final) frames from exceeding the number of available pins. In the case of final frame, we still only allow a spare unless we have a strike already.

ALTER TABLE bowling.frame
ADD CONSTRAINT max_spare_unless_frame_10_strike CHECK
(
  ball1 + ball2 <= 10 OR (frame = 10 AND ball1 = 10)
);

This is as simple as it can be. Because ball 2 may be null but ball 1 may not, and each ball must be no greater than 10, this is enough to encapsulate the requirement. There is one slightly incorrect circumstance: a value of (10, 0) would be valid, which is strictly incorrect (ball 2 was never bowled). In the case of the calculations it’s all correct, but if a 0 was bowled immediately after a strike, it may be possible to insert that as ball 2, which would be misleading.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_2_never_bowled_after_strike CHECK
(
  ball2 IS NULL OR ball1 < 10 OR frame = 10
);

We can now prevent ball 3 from being set unless we are on frame 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball_3_only_in_frame_10 CHECK
(
  ball3 IS NULL OR frame = 10
);

A follow-up to the previous constraint: we only get to bowl ball 3 if we have bowled ball 2, and scored a strike or spare already. Note, we may have a strike on the first ball, which means we might have more than 10.

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_only_if_eligible CHECK
(
  ball3 IS NULL OR (ball2 IS NOT NULL AND ball1 + ball2 >= 10)
);

Finally, we have some specific allowable conditions for that last ball. We already know that we must have scored a strike or spare with the first two balls, but we need to know how many pins are available to us.

If we scored a spare or two strikes, then we may have any number up to 10 to score now. Otherwise, we have however many pins were left by ball 2 (which means ball 1 must have been a strike).

ALTER TABLE bowling.frame
ADD CONSTRAINT ball3_max_spare_or_strike CHECK
(
  ball2 + ball3 <= 10
  OR
  ball1 + ball2 = 20
  OR
  ball1 + ball2 = 10
);

I find those constraints much easier to read that the original ones.


I’ve written a view, that uses a couple of Common Table Expressions (CTEs), as well as the window functions Dwain discussed.

CREATE OR REPLACE VIEW bowling.frame_score AS (
  WITH pin_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      -- Get the first ball from the next frame.
      -- Used by strike and spare.
      LEAD(ball1, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_1,
      -- Get the second ball from the next frame.
      -- Used by strike.
      LEAD(ball2, 1) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_ball_2,
      -- Get the first ball from the next next frame.
      -- Used by double-strike.
      LEAD(ball1, 2) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
      ) AS next_next_ball_1
    FROM bowling.frame
  ),
  frame_counts AS (
    SELECT
      game_id,
      player_id,
      frame,
      ball1, ball2, ball3,
      CASE
      -- We will start with frame 10: when we have a strike
      -- or spare, we get all three balls.
      WHEN frame = 10 AND ball1 + ball2 >= 10 THEN
        ball1 + ball2 + ball3
      -- On a strike, we get the next two balls. This could
      -- be from the next frame, or include the first ball
      -- of the frame after that. Note that in frame 9, we will
      -- also look at the second ball from fram 10, rather than
      -- looking for a non-existent frame 11.
      WHEN ball1 = 10 THEN
        ball1 + next_ball_1 + (
          CASE WHEN next_ball_1 = 10 AND frame < 9 THEN
            next_next_ball_1
          ELSE
            next_ball_2
          END
        )
      -- In the case of a spare, grab the next ball. We already
      -- handled a spare on frame 10 above.
      WHEN ball1 + ball2 = 10 THEN
        ball1 + ball2 + next_ball_1
      -- Otherwise, it's just the two balls we bowled in this frame.
      ELSE
        ball1 + ball2
      END AS score
    FROM pin_counts
  )

  -- We have everything we need in the previous CTE, except that
  -- shows us the frame score, rather than the running tally.
  -- We need to do that in another window function here, but
  -- only calculate a value when the frame's wave function has
  -- collapsed (ie, it's score is known).
  SELECT
    frame_counts.*,
    CASE WHEN score IS NOT NULL THEN
      SUM(score) OVER (
        PARTITION BY game_id, player_id
        ORDER BY frame
        ROWS UNBOUNDED PRECEDING
      )
    ELSE NULL END AS total
  FROM frame_counts
);

Again, I think this is a simpler query, and easier to read. But, I guess I wrote it.

We can insert the same data as used there, and look at our results:

-- Game 1
INSERT INTO bowling.frame VALUES
  (1, 1, 1, 7, 2, NULL),
  (1, 1, 2, 3, 7, NULL),
  (1, 1, 3, 6, 4, NULL),
  (1, 1, 4, 10, NULL, NULL),
  (1, 1, 5, 10, NULL, NULL),
  (1, 1, 6, 10, NULL, NULL),
  (1, 1, 7, 9, 1, NULL),
  (1, 1, 8, 10, NULL, NULL),
  (1, 1, 9, 8, 1, NULL),
  (1, 1, 10, 6, 3, NULL);

-- Game 2
INSERT INTO bowling.frame VALUES
  (2, 1, 1, 10, NULL, NULL),
  (2, 1, 2, 3, 7, NULL),
  (2, 1, 3, 10, NULL, NULL),
  (2, 1, 4, 6, 4, NULL),
  (2, 1, 5, 10, NULL, NULL),
  (2, 1, 6, 9, 1, NULL),
  (2, 1, 7, 10, NULL, NULL),
  (2, 1, 8, 8, 2, NULL),
  (2, 1, 9, 10, NULL, NULL),
  (2, 1, 10, 7, 3, 10);

-- Game 3
INSERT INTO bowling.frame VALUES
  (3, 1, 1, 10, NULL, NULL),
  (3, 1, 2, 10, NULL, NULL),
  (3, 1, 3, 10, NULL, NULL),
  (3, 1, 4, 10, NULL, NULL),
  (3, 1, 5, 10, NULL, NULL),
  (3, 1, 6, 10, NULL, NULL),
  (3, 1, 7, 10, NULL, NULL),
  (3, 1, 8, 10, NULL, NULL),
  (3, 1, 9, 10, NULL, NULL),
  (3, 1, 10, 10, 10, 10);
$ SELECT * FROM frame_score;

 game_id | player_id | frame | ball1 | ball2  | ball3  | score | total
---------+-----------+-------+-------+--------+--------+-------+-------
       1 |         1 |     1 |     7 |      2 | <NULL> |     9 |     9
       1 |         1 |     2 |     3 |      7 | <NULL> |    16 |    25
       1 |         1 |     3 |     6 |      4 | <NULL> |    20 |    45
       1 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |    75
       1 |         1 |     5 |    10 | <NULL> | <NULL> |    29 |   104
       1 |         1 |     6 |    10 | <NULL> | <NULL> |    20 |   124
       1 |         1 |     7 |     9 |      1 | <NULL> |    20 |   144
       1 |         1 |     8 |    10 | <NULL> | <NULL> |    19 |   163
       1 |         1 |     9 |     8 |      1 | <NULL> |     9 |   172
       1 |         1 |    10 |     6 |      3 | <NULL> |     9 |   181
       2 |         1 |     1 |    10 | <NULL> | <NULL> |    20 |    20
       2 |         1 |     2 |     3 |      7 | <NULL> |    20 |    40
       2 |         1 |     3 |    10 | <NULL> | <NULL> |    20 |    60
       2 |         1 |     4 |     6 |      4 | <NULL> |    20 |    80
       2 |         1 |     5 |    10 | <NULL> | <NULL> |    20 |   100
       2 |         1 |     6 |     9 |      1 | <NULL> |    20 |   120
       2 |         1 |     7 |    10 | <NULL> | <NULL> |    20 |   140
       2 |         1 |     8 |     8 |      2 | <NULL> |    20 |   160
       2 |         1 |     9 |    10 | <NULL> | <NULL> |    20 |   180
       2 |         1 |    10 |     7 |      3 |     10 |    20 |   200
       3 |         1 |     1 |    10 | <NULL> | <NULL> |    30 |    30
       3 |         1 |     2 |    10 | <NULL> | <NULL> |    30 |    60
       3 |         1 |     3 |    10 | <NULL> | <NULL> |    30 |    90
       3 |         1 |     4 |    10 | <NULL> | <NULL> |    30 |   120
       3 |         1 |     5 |    10 | <NULL> | <NULL> |    30 |   150
       3 |         1 |     6 |    10 | <NULL> | <NULL> |    30 |   180
       3 |         1 |     7 |    10 | <NULL> | <NULL> |    30 |   210
       3 |         1 |     8 |    10 | <NULL> | <NULL> |    30 |   240
       3 |         1 |     9 |    10 | <NULL> | <NULL> |    30 |   270
       3 |         1 |    10 |    10 |     10 |     10 |    30 |   300
(30 rows)

Building the average scores for a player is likewise similar. Because I’m using a VIEW, I can jut reference that.

SELECT
  player_id,
  AVG(total) as average
FROM frame_score
WHERE frame=10
GROUP BY player_id;
 player_id |       average
-----------+----------------------
         1 | 227.0000000000000000
(1 row)

I’m fairly sure I’ve rewritten the constraints correctly, but may have missed some. Here are some of the condition tests that show invalid constraints:

$ INSERT INTO bowling.frame VALUES(1, 2, 0, 9, NULL, NULL);
ERROR:  value for domain frame_number violates check constraint "frame_number_check"
Time: 0.405 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 11, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.215 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, -1, NULL, NULL);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.218 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 3, NULL);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 1, 8, 3, null).
Time: 0.332 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 1, 1);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 1, 8, 1, 1).
Time: 0.392 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 1, 8, 2, 1);
ERROR:  new row for relation "frame" violates check constraint "ball_3_only_in_frame_10"
DETAIL:  Failing row contains (1, 2, 1, 8, 2, 1).
Time: 0.327 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 3, 1);
ERROR:  new row for relation "frame" violates check constraint "max_spare_unless_frame_10_strike"
DETAIL:  Failing row contains (1, 2, 10, 8, 3, 1).
Time: 0.340 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 8, 2, 11);
ERROR:  value for domain ball violates check constraint "ball_check"
Time: 0.200 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, NULL, 10);
ERROR:  new row for relation "frame" violates check constraint "ball3_only_if_eligible"
DETAIL:  Failing row contains (1, 2, 10, 10, null, 10).
Time: 0.316 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 5, 10, 0, NULL);
ERROR:  new row for relation "frame" violates check constraint "ball_2_never_bowled_after_strike"
DETAIL:  Failing row contains (1, 2, 5, 10, 0, null).
Time: 0.307 ms
$ INSERT INTO bowling.frame VALUES(1, 2, 10, 10, 2, 9);
ERROR:  new row for relation "frame" violates check constraint "ball3_max_spare_or_strike"
DETAIL:  Failing row contains (1, 2, 10, 10, 2, 9).
Time: 0.323 ms

Finally, I rewrote the pretty printer. It’s not quite perfect (I don’t like how I get the plus signs at the newline character), but it is workable:

WITH symbols AS (
  SELECT
    game_id, player_id, frame,
    CASE WHEN ball1 = 10 THEN 'X' ELSE ball1::text END as ball1,
    CASE WHEN ball2 IS NULL THEN ' '
         WHEN ball1 + ball2 = 10 THEN '/'
         WHEN ball1 = 10 AND ball2 = 10 THEN 'X'
         ELSE ball2::text
         END as ball2,
    CASE WHEN ball3 IS NULL THEN ' '
    WHEN ball3 = 10 THEN 'X'
    WHEN ball3 + ball2 = 10 THEN '/'
    ELSE ball3::text
    END as ball3,
    lpad(total::text, 5, ' ') as total
  FROM
    frame_score
  ORDER BY game_id, player_id, frame
), grouped_data AS (
  SELECT
    game_id,
    player_id,
    array_agg(ball1) ball1,
    array_agg(ball2) ball2,
    array_agg(ball3) ball3,
    array_agg(total) total
  FROM
    symbols
  GROUP BY
    game_id, player_id
)
SELECT
  game_id,
  player_id,
  ball1[1] || ' | ' || ball2[1] || ' ' || chr(10) || total[1] AS "1",
  ball1[2] || ' | ' || ball2[2] || ' ' || chr(10) || total[2] AS "2",
  ball1[3] || ' | ' || ball2[3] || ' ' || chr(10) || total[3] AS "3",
  ball1[4] || ' | ' || ball2[4] || ' ' || chr(10) || total[4] AS "4",
  ball1[5] || ' | ' || ball2[5] || ' ' || chr(10) || total[5] AS "5",
  ball1[6] || ' | ' || ball2[6] || ' ' || chr(10) || total[6] AS "6",
  ball1[7] || ' | ' || ball2[7] || ' ' || chr(10) || total[7] AS "7",
  ball1[8] || ' | ' || ball2[8] || ' ' || chr(10) || total[8] AS "8",
  ball1[9] || ' | ' || ball2[9] || ' ' || chr(10) || total[9] AS "9",
  ball1[10] || ' | ' || ball2[10] || ' | ' || ball3[10] || ' ' || chr(10) || lpad(total[10], 9, ' ') AS "10"
FROM grouped_data;
 game_id | player_id |   1    |   2    |   3    |   4    |   5    |   6    |   7    |   8    |   9    |     10
---------+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+------------
       1 |         1 | 7 | 2 +| 3 | / +| 6 | / +| X |   +| X |   +| X |   +| 9 | / +| X |   +| 8 | 1 +| 6 | 3 |   +
         |           |     9  |    25  |    45  |    75  |   104  |   124  |   144  |   163  |   172  |       181
       2 |         1 | X |   +| 3 | / +| X |   +| 6 | / +| X |   +| 9 | / +| X |   +| 8 | / +| X |   +| 7 | / | X +
         |           |    20  |    40  |    60  |    80  |   100  |   120  |   140  |   160  |   180  |       200
       3 |         1 | X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X |   +| X | X | X +
         |           |    30  |    60  |    90  |   120  |   150  |   180  |   210  |   240  |   270  |       300
(3 rows)

That will do for now. Corrections welcome!

Humax 4tune

I recently bought a Humax 4tune (hms1000t). I’m rather liking in so far: 4 tuners, a reasonable UI.

I would like to be able to get it to do a bit more, so I’ve started investigating it. It shows the following ports as being open:

Port Scan has started…

Port Scanning host: 10.0.1.30

	 Open TCP Port: 	21     		ftp
	 Open TCP Port: 	139    		netbios-ssn
	 Open TCP Port: 	445    		microsoft-ds
	 Open TCP Port: 	9000   		cslistener
	 Open TCP Port: 	9251
	 Open TCP Port: 	9252
	 Open TCP Port: 	36666
	 Open TCP Port: 	36667
	 Open TCP Port: 	48629
	 Open TCP Port: 	50001
	 Open TCP Port: 	55200
	 Open TCP Port: 	56789
	 Open TCP Port: 	56790
	 Open TCP Port: 	58816
	 Open TCP Port: 	60001

Port Scan has completed…

Obviously, I’ve turned on the FTP and Samba services.

Port 9000 appears to be an RTSP service:

$ curl http://<ip-address>:9000
RTSP

Ports 9251 and 9252 respond with:

{"errorCode": 404, "errorDescription":"Specified WID not found"}

The latter requires HTTPS, but uses a certificate for the domain name htvdev.humaxtvportal.com. The humaxtvportal.com domain is registered.

Port 36666 responds with:

RTSP/1.0 200 OK
CSeq: (null)
Apple-Jack-Status: connected; type=analog

Port 50001 responds with an XML document:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0" xmlns:dlna="urn:schemas-dlna-org:device-1-0" xmlns:hmx="urn:schemas-humax-co-kr:metadata-1-0"  >
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <deviceType>urn:schemas-upnp-org:device:MediaServer:1</deviceType>
    <hmx:X_HMXCAP>TSR,MULTIVALUE,RCUTYPE_LIME,HX_N_ALL_P</hmx:X_HMXCAP>
    <dlna:X_DLNACAP xmlns:dlna="urn:schemas-dlna-org:device-1-0"></dlna:X_DLNACAP>
    <dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMS-1.50</dlna:X_DLNADOC>
    <friendlyName>Lounge 10.0.1.30</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>None</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <icube:X_WOON_PAIRING xmlns:icube="urn:schemas-icube-co-kr:WoonDevice-1-0">hmx_base_08eb74e20b66</icube:X_WOON_PAIRING>
    <iconList>
      <icon>
        <mimetype>image/jpeg</mimetype>
        <width>48</width>
        <height>48</height>
        <depth>24</depth>
        <url>/icon/dms_48.jpg</url>
      </icon>
      <icon>
        <mimetype>image/png</mimetype>
        <width>48</width>
        <height>48</height>
        <depth>24</depth>
        <url>/icon/dms_48.png</url>
      </icon>
      <icon>
        <mimetype>image/jpeg</mimetype>
        <width>120</width>
        <height>120</height>
        <depth>24</depth>
        <url>/icon/dms_120.jpg</url>
      </icon>
      <icon>
        <mimetype>image/png</mimetype>
        <width>120</width>
        <height>120</height>
        <depth>24</depth>
        <url>/icon/dms_120.png</url>
      </icon>
    </iconList>
    <serviceList>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
        <SCPDURL>ConnectionManager/scpd.xml</SCPDURL>
        <controlURL>ConnectionManager/control</controlURL>
        <eventSubURL>ConnectionManager/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ContentDirectory:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ContentDirectory</serviceId>
        <SCPDURL>ContentDirectory/scpd.xml</SCPDURL>
        <controlURL>ContentDirectory/control</controlURL>
        <eventSubURL>ContentDirectory/event</eventSubURL>
      </service>
    </serviceList>
  </device>
</root>

This appears to be DLNA-related.

Likewise with port 55200:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0">
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <dlna:X_DLNADOC xmlns:dlna="urn:schemas-dlna-org:device-1-0">DMR-1.50</dlna:X_DLNADOC>
    <deviceType>urn:schemas-upnp-org:device:MediaRenderer:1</deviceType>
    <friendlyName>Lounge</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>http://www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>[redacted]</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <serviceList>
      <service>
        <serviceType>urn:schemas-upnp-org:service:AVTransport:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:AVTransport</serviceId>
        <SCPDURL>AVTransport/scpd.xml</SCPDURL>
        <controlURL>AVTransport/control</controlURL>
        <eventSubURL>AVTransport/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ConnectionManager:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ConnectionManager</serviceId>
        <SCPDURL>ConnectionManager/scpd.xml</SCPDURL>
        <controlURL>ConnectionManager/control</controlURL>
        <eventSubURL>ConnectionManager/event</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:RenderingControl:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:RenderingControl</serviceId>
        <SCPDURL>RenderingControl/scpd.xml</SCPDURL>
        <controlURL>RenderingControl/control</controlURL>
        <eventSubURL>RenderingControl/event</eventSubURL>
      </service>
    </serviceList>
  </device>
</root>

Port 56789 returns a 404: I’m guessing we can figure out a path that might work.

Port 56790 is a “DIAL SERVER”:

<?xml version="1.0" encoding="utf-8"?>
<root xmlns="urn:schemas-upnp-org:device-1-0" xmlns:r="urn:restful-tv-org:schemas:upnp-dd">
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <device>
    <deviceType>urn:schemas-upnp-org:device:tvdevice:1</deviceType>
    <friendlyName>HMS1000T - DIAL SERVER:10.0.1.30</friendlyName>
    <manufacturer>HUMAX</manufacturer>
    <manufacturerURL>http://www.humaxdigital.com</manufacturerURL>
    <modelDescription>HUMAX Set-Top Box OCTO2.0 Platform</modelDescription>
    <modelName>HMS1000T</modelName>
    <modelNumber>undefined</modelNumber>
    <modelURL>http://www.humaxdigital.com</modelURL>
    <serialNumber>[redacted]</serialNumber>
    <UDN>uuid:B38D5042-B86E-44F0-A758-F4F8BD6E3053</UDN>
    <serviceList>
      <service>
        <serviceType>urn:dial-multiscreen-org:service:dial:1</serviceType>
        <serviceId>urn:dial-multiscreen-org:serviceId:dialControl</serviceId>
        <SCPDURL>dial/scpd.xml</SCPDURL>
        <controlURL/>
        <eventSubURL/>
      </service>
    </serviceList>
  </device>
</root>

Port 58816 rejects connections, but port 60001 is an interesting one:

$ telnet [address] 60001
Trying [address]...
Connected to [address].
Escape character is '^]'.
root:> help
* help [root:]
** registered process list
    corsair                     homma                       prism                       obama
    sama                        ipepg
** helpall : display all command set
root:> helpall
* help all
[corsair]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
  [OAPI]
     oapi_output_vid  :  oapi_output_video debug api call
     oapi_output_hdm  :  oapi_output_hdmi debug api call
[homma]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
[prism]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [ONDK]
     hello            :  help..
     vkcmd            :  vkcmd [mem/sem/task/timer]
     timerlist        :  timerlist
     gwmlist          :  gwmlist
  [OAPI]
     oapi_output_hdm  :  oapi_output_hdmi debug api call
[obama]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [pal_hdmi]
     hdcp_enable      :  HDCP on/off
     status           :  Print HDMI status.
     capa             :  Print HDMI TV capability.
     hdcp_event       :  Send HDCP event to appkit
     output           :  Enable/Disable HDMI output.
     avinfo           :  Set AVI infoframe.
  [OM]
     om_output_hdmi   :  om_output_hdmi debug api call
     om_output_video  :  om_output_video debug api call
  [PAL]
     pal_scl          :  pal scaler debug api call
     pal_scart        :  pal scart debug api call
     pal_out          :  pal out debug api call
     pal_video        :  pal video debug api call
     pal_audio        :  pal audio debug api call
  [RLIB]
     print_rsc        :  print Resource Info
  [SVC]
     svc_av           :  tv service /AV debug api call
     svc_out          :  svc out debug api call
[sama]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
  [SamaCmd]
     print_systime    :  print system time
     print_rpc        :  print SAMA rpc
     print_slot       :  print SAMA slot status
     print_sch        :  print SAMA Schedules
     print_timer      :  print SAMA Timer
     make_sch         :  make SAMA Schedule
     del_sch          :  delete SAMA Schedule
     make_ransch      :  make SAMA Schedule (TBR)
     del_sch_all      :  delete ALL SAMA Schedule
[ipepg]
  [mem]
     printmem         :  print memory usage to outputfile
     memconf          :  configuration of printmem
     memcheck         :  check memory corruption - if corrupted, log in corrupted.log
     currsec          :  print the current second from the boot
  [RLIB]
     print_rsc        :  print Resource Info
root:>

Oh my. More to come, I suspect…

Postgres Tree Shootout part 2: Adjacency List using CTEs

This is the second post in an ongoing series dealing with storing Hierarchical or Tree data structures in Postgres. You should read the Introduction if you haven’t already.

This post contains the queries that illustrate how an Adjacency List model can be used to represent a Hierarchical set of data, including data definitions, and the various operations that have been defined in the aforementioned introduction.

I’ve discussed Adjacency Lists in the past, but I’ll quickly recap why I think they are good.

  • They are conceptually simple to understand
  • They enforce referential integrity
  • They can be modelled with most ORMs without any extra infrastructure
  • Many of the operations are non-complex
  • Recursive queries allow us to perform the complex queries in reasonable time

To help build suspense (but more because I haven’t yet come up with a way to generate a nice reproducible, yet complex tree), this post may discuss the complexity of the queries, but will not contain any measurements.

Initial tree

Before each operation, our data will look like this (where parents point to children):

2014-11-26 11:27ZCanvas 1Layer 112345678910111213141516

We will assume reversion back to this structure after each operation: we could do this using a TRUNCATE followed by an INSERT; or we could run the operation in a transaction and rollback.

There may be a post which shows the effects of each of the queries below in a graphical form.

Table structure

Adjacency Lists are dead simple. Each node simply contains a reference to it’s parent node.

CREATE TABLE nodes (
  node_id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES nodes(node_id)
);

We can insert our data using a single statement:

INSERT INTO nodes VALUES
  (1, NULL),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 2),
  (6, 3),
  (7, 3),
  (8, 4),
  (9, 8),
  (10, NULL),
  (11, 10),
  (12, 11),
  (13, 11),
  (14, 12),
  (15, 12),
  (16, 12);

Insertions

Inserting a single leaf node is simple. To insert a single node as a child of node 13, for example:

INSERT INTO nodes (parent_id) VALUES (13);

Inserting a new root node is slightly more complicated. In most ORMs, we would probably do it in two queries: one to create the new node, and a second to update the other root nodes to point to this one. We’ll see that below.

We can do slightly better in raw SQL: UPDATE our table with the result of an INSERT ... RETURNING that occurs inside a Common Table Expression (CTE, or WITH query).

WITH parent AS (
  INSERT INTO nodes(parent_id)
  VALUES (NULL)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = parent.node_id
FROM parent
WHERE parent_id IS NULL;

We can use the same pattern to insert an intermediate node in our tree. For instance, inserting a new node between nodes 11 and 12:

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (11)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.node_id = 12;

And our last insert, adding a child node, that gets it’s siblings as children. For instance, adding a new node under node 12, which gets all of node 12’s children as it’s children.

WITH created_node AS (
  INSERT INTO nodes(parent_id)
  VALUES (12)
  RETURNING node_id
)
UPDATE nodes
SET parent_id = created_node.node_id
FROM created_node
WHERE nodes.parent_id = 12;

All of these queries should perform relatively well: the CTE will be very simple (as it is no different to the single leaf insert), and the UPDATE should likewise be fairly simple: it needs to filter out which existing nodes do not need to be updated; and then it needs to update the remainder of the rows with the value pulled from the CTE.

This theoretically is only marginally more complex than just a simple UPDATE foo SET bar=baz WHERE quux IS NULL style query.

If we were using an ORM, we might need to do this in two queries: something like this (in Django):

# Insert new root node: all other root nodes now have this as a parent
new_node = Node.objects.create()
Node.objects.filter(parent=None).exclude(pk=new_node.pk).update(parent=new_node)
# Could possibly do as:
Node.objects.filter(parent=None).update(parent=Node.objects.create().pk)

# Insert new node, with a single child as it's child (and that child's previous
# parent as it's parent)
new_node = Node.objects.create(parent=old_node.parent)
old_node.parent = new_node
old_node.save()

# Insert new node, with children of that node's parent now children of the node.
new_node = parent_node.children.create()
parent_node.children.exclude(pk=new_node.pk).update(parent=new_node)
# Again, may be able to do:
parent_node.children.update(parent=parent_node.children.create().pk)

Note the required exclusion of the newly created node: we don’t have to do this in the CTE versions, as that doesn’t “exist” at the time the other part of the query runs.

Removals

Removing a single leaf node is no different than removing a row from a normal table. Removing node 9, for instance:

DELETE FROM nodes WHERE node_id = 9;

Because the information about parent-child relationships is stored in the child, we do not need to do anything else to maintain the tree.

To remove a single root node (in this case, node 1), and promote all children to root nodes themselves, we can do two queries:

UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;
DELETE FROM nodes WHERE node_id = 1;

It may be possible to do this in a single query, similar to the CTE queries above, but I’m not sure of the benefit.

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 1
)
UPDATE nodes SET parent_id = NULL WHERE parent_id = 1;

The same pattern can be used for removing a node, and attaching it’s children to it’s parent. Here, we will remove node 2, and attach it’s children (4 and 5) as children of it’s parent, node 1:

UPDATE nodes
SET parent_id = (SELECT parent_id FROM nodes WHERE node_id = 2)
WHERE parent_id = 2;

DELETE from nodes WHERE node_id = 2;

This is a place where using a CTE might make things clearer - especially if we have the node-to-be-deleted’s id, but not it’s parent:

WITH deleted AS (
  DELETE FROM nodes
  WHERE node_id = 2
  RETURNING node_id, parent_id
)
UPDATE nodes
SET parent_id = deleted.parent_id
FROM deleted
WHERE nodes.parent_id = deleted.node_id;

Righto, now we are up to the traditionally “hard” things for an Adjacency List to perform. Dealing with removing an arbitrary depth of (sub)tree.

We’ll need to create a recursive CTE, and delete according to that. Let’s just select from that initially, so we can see what the CTE data will look like:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT * FROM tree;
 node_id | ancestors
---------+------------
       1 | {}
      10 | {}
       2 | {1}
       3 | {1}
      11 | {10}
       4 | {1,2}
       5 | {1,2}
       6 | {1,3}
       7 | {1,3}
      12 | {10,11}
      13 | {10,11}
       8 | {1,2,4}
      14 | {10,11,12}
      15 | {10,11,12}
      16 | {10,11,12}
       9 | {1,2,4,8}
(16 rows)

Coolio. So, on to our operations. Let’s remove the subtree starting with node 2. I’ll hide the CTE, since it will be the same for quite a few of these operations:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
) OR node_id = 2;

The query is identical for a full tree (node 1 and descendants):

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 1 = ANY(tree.ancestors)
) OR node_id = 1;

And it’s nearly identical for just the descendants of a given node. Here, for all of node 2’s descendants, but not that node itself:

WITH RECURSIVE tree AS (...)
DELETE FROM nodes
WHERE node_id IN (
  SELECT node_id FROM tree WHERE 2 = ANY(tree.ancestors)
);

Moves

Because the relationship is stored purely on the child element, moving around trees and subtrees is very easy. We can start with moving subtree starting with 3 to underneath node 4:

UPDATE nodes
SET parent_id = 4 WHERE node_id = 3;

Nothing surprising there. Similarly, the query is identical for moving a leaf to a different parent, a root node into a tree, and turning a subtree into a full tree (making that node a root node).

UPDATE nodes SET parent_id = 6 WHERE node_id = 5;
UPDATE nodes SET parent_id = 8 WHERE node_id = 10;
UPDATE nodes SET parent_id = NULL WHERE node_id = 2;

The final move: all of node’s children to a different node is almost as simple:

UPDATE nodes SET parent_id = 5 WHERE parent_id = 12;

This seems to be a situation where Adjacency Lists are really good. None of these queries are any more complex than the simplest UPDATE you could think of.

Fetches

Using the same CTE, we can perform our fetches. We may need to extend it to deal with depths, but since the ancestors column contains ancestors starting with the root node, we could count stuff in there. Let’s see how it goes.

Descendants

Fetching all descendants of a given node just means we want to see if the node occurs at all in each row’s ancestors. To get all of node 10’s descendants:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree WHERE 10 = ANY(tree.ancestors);

However, we could improve this by starting with just the node we care about, or more specifically, it’s children:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id FROM tree;

Obviously, this is far less generic, but it is also significantly less complex. For starters, it only builds up the part of the tree we care about, and then just returns the node ids, rather than building up the whole tree, and then discarding the parts that are not required.

The same code can be used for determining the number of descendants, but with a COUNT(node_id) in the final query.

To get our depth-limited query, we can approach from two directions. To get the subtree to depth 2 from above:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[10]::integer[] AS ancestors FROM nodes WHERE parent_id = 10

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
  AND cardinality(tree.ancestors) < 2
)
SELECT node_id FROM tree;

To do the same in the more generic form we can move the cardinality test into the final query: but notably we need to test using <= depth instead of < depth. This is because the test we are doing in the query above is on the parent’s ancestor depth.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
)
SELECT node_id
FROM tree
WHERE 10 = ANY(tree.ancestors)
AND cardinality(tree.ancestors) <= 2;

Ancestors

Fetching ancestors from our generic CTE is a bit simpler, because that data is already part of the query:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15;

To do the equivalent of the hand-built CTE, we would need to start with the node, and build back the other way. It’s getting late here, so I can’t think of a way to do this right now that doesn’t get stuck doing infinite recursion.

The count query is an interesting one: we can just remove the need to unnest, and take the cardinality:

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, tree.ancestors || nodes.parent_id
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT cardinality(ancestors) FROM tree WHERE node_id = 15;

The depth query is a little trickier. We want to know the ancestors of node 15, up to a depth of 2. If our ancestors array was in the reverse order, we should be able to unnest and limit.

WITH RECURSIVE tree AS (
  SELECT node_id, ARRAY[]::integer[] AS ancestors
  FROM nodes WHERE parent_id IS NULL

  UNION ALL

  SELECT nodes.node_id, nodes.parent_id || tree.ancestors
  FROM nodes, tree
  WHERE nodes.parent_id = tree.node_id
) SELECT unnest(ancestors) FROM tree WHERE node_id = 15 LIMIT 2;

We can do this because a node only has one parent: so limiting the number of ancestors (when sorted nearest ancestor first) is the same as limiting the depth.

Special queries

Fetching all leaf nodes is just a matter of excluding those that have a relationship to another node as it’s parent:

SELECT node_id FROM nodes
WHERE node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Trick for new players: if you leave off the WHERE clause in that sub-query, you won’t get any matches!

Fetching the number of leaf nodes is trivial.

Fetching root nodes (or the number of) is simpler than leaf nodes:

SELECT node_id FROM nodes
WHERE parent_id IS NULL;

Fetching non-leaf nodes, and non-root nodes is just negations of the two queries above:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

SELECT node_id FROM nodes WHERE parent_id IS NOT NULL;

And the non-leaf, non-root nodes just combines these two queries:

SELECT node_id FROM nodes WHERE node_id IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
) AND parent_id IS NOT NULL;

As an aside: there is also the inverse of this: nodes which are isolated (root AND leaf nodes):

SELECT node_id FROM nodes
WHERE parent_id IS NULL
AND node_id NOT IN (
  SELECT parent_id FROM nodes WHERE parent_id IS NOT NULL
);

Well, that’s our operations. Most of them are really simple. For anything that requires us to fetch or delete a whole subtree, we needed to revert to recursive CTEs, and for some of the other operations, using a CTE makes it simpler (and easier to understand).

Next, we will look at an alternative to the CTE operations, using a recursive view. From there, we should be able to look at a trigger-based approach that materializes our tree (node, ancestors) data in a table, and keeps it up to date. That’s, as I hinted, getting close to a Materialized Path approach, but keeps the conceptual simplicity of the Adjacency List, and hopefully prevents possible issues relating to referential integrity.

Postgres Tree Shootout part 1: Introduction.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.


Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.


Operations

I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.

Insertions

  • Insert a single leaf node
  • Insert a single root node (all existing root nodes will then point to this)
  • Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)
  • Insert a single node partway through the tree, with this node’s parent’s existing children all becoming children of the new node.

Removals

  • Remove a single leaf node
  • Remove a single root node (all children of this are promoted to root nodes)
  • Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
  • Remove a subtree (a single non-root node and it’s descendants)
  • Remove a whole tree (a single root node and all descendants)
  • Remove all descendants of a specific node (but not the node itself)

Moves

  • Move a subtree from one parent to another
  • Move a single leaf node to a different parent
  • Move a root node into a tree
  • Make a subtree into a tree (turn a node into a root node).
  • Move all children of a node to a different parent

Fetches

  • Fetch all descendants of a given node
  • Fetch the number of descendants of a given node
  • Fetch descendants of a given node to a given depth
  • Fetch the number of descendants of a given node to a given depth
  • Fetch all ancestors of a given node
  • Fetch the number of ancestors of a given node
  • Fetch ancestors of a given node to a given depth
  • Fetch the number of ancestors of a given node to a given depth

    I don’t think this makes any sense.

  • Fetch all leaf nodes
  • Fetch the number of leaf nodes
  • Fetch all root nodes
  • Fetch the number of root nodes
  • Fetch all non-leaf nodes
  • Fetch the number of non-leaf nodes
  • Fetch all non-root nodes
  • Fetch the number of non-root nodes
  • Fetch all non-root, non-leaf nodes
  • Fetch the number of non-root, non-leaf nodes

Aggregating ranges in Python

This is a bit of a follow-up to Aggregating Ranges in Postgres.

Since we don’t have a nice range type in Python, we will just use a tuple that contains a lower and upper bound. We will assume that this is a canonical form: the lower bound is inclusive, the upper bound is non-inclusive. We will also assume (for simplicity) that the 0-th item is always less than the 1-th item in the tuple.

Given a list of these range-tuples, we want to do the two things we saw in the previous post.

  • Aggregate them into the simplest possible array of range-tuples, using a union operation.
  • Determine the parts that are missing.

We will use in in a similar way:

>>> data =[(2,3),(3,4),(11,12),(5,12),(4,5)]
>>> aggregate_union(data)
(2, 12)
>>> missing_ranges(data)
[(None, 2), (12, None)]

Luckily, None sorts before any integer in python, so we will just be able to use the normal sort.

def aggregate_union(data):
    if not data:
      return None

    sorted_data = sorted(data)
    result = sorted_data[0]

    for lower, upper in sorted_data[1:]:
        # If we ever find result[1] is None, we know it covers any
        # other possible values, so we can stop at that point.
        if result[1] is None:
            break

        if lower > result[1]:
            return None  # Or raise an exception, if you want.

        result = (result[0], max(upper, result[1]))

    return result

The alternative, missing_ranges(data) takes cues from the SQL version too.

def missing_ranges(data):
    if not data:
      return (None, None)

    result = []
    # We do a little fancy stuff here: append an extra item that
    # mimics what we were able to use lead for, but in a different
    # way so we can use [i + 1] later.
    sorted_data = sorted(data) + [(None, None)]

    if sorted_data[0][0] is not None:
        # Note: the upper bound here is not quite correct!
        result.append((None, sorted_data[0][1]))

    for i, (lower, upper) in enumerate(sorted_data):
        # Grab the next item in our sorted list. Normally, we would check
        # to see if we are at the end, but we will rely on the break later
        # on: we can always stop processing when the upper bound of our
        # next item is `None`
        next = sorted_data[i + 1]

        if upper < next[0] or (next[0] is None and upper is not None):
            result.append((upper, next[0]))

        # Now, exit before we ever get to a bounds error on getting the next item.
        if next[1] is None:
          break

    return result

However, there is a problem here that is nicely solved by the Range types within postgres:

>>> missing_ranges(data)
[(None, 3), (12, None)]

We need to subtract from the first object’s upper bound. Which is easy with integer tuple-ranges (basically, any discrete range type), but not so much with continuous ranges.

Aggregating Ranges in Postgres

Postgres’ range types are very cool. You can use them to store, as you may guess, a value that is a range. There are several included range types, and it’s possible to create your own range type. For now, we’ll just look at using a simple int4range: although everything in principle could be applied to any range type.

Firstly, a quick discussion of how the range types work.

There is a literal form, and then a functional form:

SELECT '(2,17]'::int4range;

  int4range
 -----------
  [3,18)
 (1 row)

SELECT int4range(2, 17, '(]');

 int4range
-----------
 [3,18)
(1 row)

The first thing you’ll notice is that postgres converts them to canonical form. We can provide the bound-types: inclusive or exclusive upper and lower values. This is the same notation you might see if you have done some mathematics.

You can also omit the upper and/or lower bounds to get a range object that goes to ±Infinity. Finally, you can also have empty ranges.

There are several operations that can be performed on a range. The most interesting one from the perspective of this article is the UNION operation.

SELECT '[3,17)'::int4range + '[10,20)';

 ?column?
----------
 [3,20)
(1 row)

You’ll get an error if your union does not result in a contiguous range. There is no way to store a discontinuous range in postgres (but you could store them in an array of int4range[], for instance).

What about if we want to get the aggregate range from a set of ranges?

SELECT value FROM range_test ;
  value
---------
 [2,3)
 [3,4)
 [11,12)
 [5,12)
 [4,5)
(4 rows)

What we’d like to be able to do is something like:

SELECT aggregate_union(value) FROM range_test;
 aggregate_union
-----------------
     '[2,12)'

Obviously, this would be subject to the same limitation on union: we would get an error (or a NULL) if the aggregate range was not continuous.

Perhaps even more useful might be a function that would show us what ranges are missing from a set of ranges. With the same data above, we might see:

SELECT missing_ranges(value) FROM range_test ;                                   missing_ranges
------------------
 {"(,2)","[12,)"}
(1 row)

Indeed, the latter is probably more useful, and, as it turns out, is simpler to perform.

We’ll start with the aggregate_union though, because it’s fun. It’s also the way I worked out the nicer solution for the last problem.

We need to create a postgres AGGREGATE to, well, aggregate the data from columns in a table. A naïve solution might be:

CREATE FUNCTION _aggregate_union(int4range, int4range)
RETURNS int4range AS $$

SELECT COALESCE(
  $1 + $2, $1, $2
);

$$ LANGUAGE SQL;

CREATE AGGREGATE aggregate_union (int4range) (
  sfunc = _aggregate_union,
  stype = int4range
);

This actually works to some degree: but only if the ranges in the query are already sorted, as each iteration of the aggregation function must result in a valid range:

SELECT aggregate_union(value) FROM (SELECT value FROM range_test ORDER BY value) x;
 aggregate_union
-----------------
 [2,12)
(1 row)

However, if the data is not sorted, it will fail.

Instead, we have to either collect all of the items in an array, sort this, and then attempt to aggregate, or, at each step, aggregate as much as possible, and add the current element to the array if we cannot perform a union.

The simpler of these (but will take more memory) is to stick them all into an array, and then sort and apply the union. We only need to define one new function to do it this way:

CREATE OR REPLACE FUNCTION _aggregate_union(int4range[])
RETURNS int4range AS $$

DECLARE
  _range int4range;
  _current int4range;

BEGIN

  _current := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  FOR _range IN SELECT unnest($1) x ORDER BY x LOOP
    IF _range && _current OR _range -|- _current THEN
      _current := _current + _range;
    ELSE
      RETURN NULL;
    END IF;
  END LOOP;

  RETURN _current;
  END;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE aggregate_union (int4range) (
  stype = int4range[],
  sfunc = array_append,
  finalfunc = _aggregate_union
);

Lets test it out:

SELECT aggregate_union(value) FROM range_test;
 aggregate_union
-----------------
 [2,12)
(1 row)

Bingo.

But what about the other case? What if we care more about what data is missing?

After spending way too many hours on playing around with this, I hit on the idea of using a window function, lead to get the data from the “next” row in the query.

SELECT
  lower(value),
  upper(value),
  lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
FROM
  range_test
ORDER BY value;

This gets us most of the way. We can see the ones with upper >= lead indicate there is no gap between the ranges, so we can filter. However, we need to do this with a subquery to be able to get access to the columns correctly:

SELECT upper, lead FROM (
  SELECT
    lower(value),
    upper(value),
    lead(lower(value)) OVER (ORDER BY lower(value) NULLS FIRST)
  FROM
    range_test
  ORDER BY value
) x WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL);

We can aggregate these into an array, and then prefix with an element if our first object is not infinite-lower-bounded.

CREATE OR REPLACE FUNCTION missing_ranges(int4range[])
RETURNS int4range[] AS $$

DECLARE
  _range int4range;
  _missing int4range[];

BEGIN
  _missing := (SELECT
    array_agg(int4range(upper, lead, '[)'))
    FROM (
      SELECT lower(x), upper(x), lead(lower(x)) OVER (ORDER BY lower(x) NULLS FIRST)
      FROM unnest($1) x ORDER BY lower NULLS FIRST
    ) x
    WHERE upper < lead OR (lead IS NULL AND upper IS NOT NULL)
  );

  _range := (SELECT x FROM unnest($1) x ORDER BY x LIMIT 1);

  IF NOT lower_inf(_range) THEN
    _missing := array_prepend(int4range(NULL, lower(_range), '[)'), _missing);
  END IF;

  RETURN _missing;
END;

$$ LANGUAGE plpgsql;

CREATE AGGREGATE missing_ranges (int4range) (
  sfunc = array_append,
  stype = int4range[],
  finalfunc = missing_ranges
);

All too easy.

It is possible to rewrite this function just using SQL, but it’s not pretty:

SELECT array_agg(value)
FROM
  (SELECT value
   FROM
     (SELECT int4range(UPPER, lead, '[)') AS value
      FROM
        (SELECT NULL AS LOWER,
                NULL::integer AS UPPER,
                LOWER(a.value) AS lead
         FROM range_test a
         ORDER BY a.value LIMIT 1) w
      WHERE lead IS NOT NULL
      UNION SELECT int4range(UPPER, lead, '[)')
      FROM
        (SELECT LOWER(b.value),
                UPPER(b.value),
                lead(LOWER(b.value)) OVER (ORDER BY LOWER(b.value) NULLS FIRST)
         FROM range_test b
         ORDER BY b.value) x
      WHERE UPPER IS NOT NULL
        AND (LEAD IS NULL OR UPPER < lead)
      ) y
   ORDER BY value) z;

I haven’t tried to see which is faster.

Misfit Shine

iOS 8’s HealthKit is a really nice feature. Prior to this, I was using Azumio’s Argus app for “life-tracking”, but it was really not providing me with much. It showed my my daily steps, included my exercise from my Garmin Forerunner 610, and helped me track my water, coffee and alcohol intake. It integrated really well with a Heart Rate recording app that they also make, but there was no real way to do anything else with this data. I was able to take photographs of my food, but that’s really it.

Along comes HealthKit.

All of a sudden, I’m no longer tied to a single application, or suite of applications that are engineered to work together. Instead, I can use Strava for my run tracking (still via my Garmin, or directly on the phone if I left it at home). These workouts, including calorie estimations, are available in HealthKit.

Similarly, step counts directly from my iPhone are also available, as are Heart Rate readings, either from the app I was using before, or an alternative. I can also add in extra values, for instance my max HR from a workout. As an aside, it would be really nice if Strava could push this value, removing the need for me to manually do it.

Even better for me would be the ability for Strava to send step counts for a workout: I usually run with my FootPod, which gives cadence information. From this, it’s possible to infer the number of steps I took in that workout. Because of this missing feature, I wound up taking my iPhone when I run, a less than optimal situation.

A couple of weeks ago, a co-worker introduced me to myfitnesspal. This is a calorie tracking app and website. Since I have (had?) about 10kg to lose, and something that makes getting relatively accurate calorie counts on the food I’m eating (although lots of the data in there is really, really bad: I’m forever looking at the nutritional breakdown of a suggested match and in some cases adding my own versions), in conjuction with the calorie usage of my excercise has become simple. Even fun.

I’m less concerned about trying out different apps on the iPhone, simply because the data (or most of it) just goes into HealthKit, which means if a better alternative comes along, I can just switch. I’m unsure if HealthKit data is restored from a backup - that’s something that I’ll find out eventually when I upgrade my phone, I guess.

Using my iPhone for step tracking means I need to have it in my pocket all of the time. This is actually really bad, as I find myself taking it out and using it more than I (or my family) would like. So, last night, I bought a Misfit Shine.

To be honest, the main reason is so that I get my phone out less, but I’m also looking forward to not taking my phone running. However, I’ve become quite attached to listening to podcasts whilst running. May have to find the little iPod nano (the little square one) I have floating around: but I’ll want to investigate the podcast syncing, since the other place I listen to them is in the car (via Bluetooth).

Having had the Misfit Shine for less than 24 hours, it’s still too early to discuss if it’s going to work for me. I do have some early comments though.

  • It’s pretty. The nice matt grey finish is unobtrusive and classy.

  • The magnet is nice and strong. I haven’t clipped it on anywhere yet, but it feels like it may stay attached while running. I’ve only used the watchband style fastener. I

  • Syncing is not automatic, even with “Automatic Sync” enabled. Indeed, I have found syncing somewhat intermittent. I’m not sure, but I think I need to tap the Shine as well as the iPhone to get it to sync. That’s a bit shit. Even if the syncing was done through a Notification Area widget, that would be an improvement. If you are only using the Shine itself for your tracking, it would not be so necessary, but since I use that data within myfitnesspal (and other HealthKit related apps), I want it updated as frequently as possible. Every minute would be great, hourly would be acceptable.

    (It seems that you need to put the Shine on the phone to sync. That would explain my issues with it, but that’s even shitter than I thought. I’m not really prepared to take a device off to sync it. Hopefully this is something that improves, although I doubt it.)

  • Sleep tracking seems reasonable, but it should be possible to trim sleep without removing the “sleep quality” data. I was using an app-based sleep tracker, but that made my phone really, really hot. And it kept falling off the bed.

    Since then, I’ve been using “Lark”, which really drives me batty with it’s cute dialog, but seems to be the only thing that will use device motion to guess when you have been sleeping (assuming you put your phone down immediately before sleep, and move it as soon as you wake). Unfortunately, I’ll need to keep that going, or manually transfer data from Misfit to HealthKit.

  • Step count data is pushed to HealthKit. No other HealthKit integration is present. It really needs Sleep data to be pushed; Weight and Dietary Calorie ideally should be fetched from HealthKit too. There is integration with myfitnesspal, but I’d like to be able to switch Dietary Calorie tracker if necessary in the future. I suspect that HealthKit is something that really could be improved in software, fairly easily.

  • The “clock” feature is kind-of cool. Except the “hour” light does not change when the time is after half-past. I kept thinking the time was quarter to nine, when it was quarter to ten. Luckily, it’s Saturday.

  • It’s also waterproof, which means I don’t have to take it off to give the boys a bath, which is nice.

There’s a fairly reasonable review from over a year ago at A weekend with Misfit’s Shine: this makes me suspect that the syncing may not improve much.

I’m not unhappy with the device so far. I’d really like to get more integration with HealthKit, so hopefully that will be forthcoming.

Horizontal Partitioning in Postgres

It never surprises me when I find another neat feature of Postgres that makes doing a potentially difficult task simple. Today, I discovered that since 9.0, Postgres has supported a really powerful way to horizontally partition data into separate tables.

For those who haven’t heard of the concept before, horizontal partitioning is where different rows are stored in different tables, depending upon something about the data within the row.

For instance, we could partition audit data into tables based upon the timestamp of the action. Thus, all entries created during October, 2014 would be stored in a table called audit_2014_10, and entries created during March 2011 would be stored in a table called audit_2011_03. And so on. Alternatively, we could have a single table per year, or however we want to partition.

This is called “Range” partitioning.

There are a couple of ways you could think about doing this in a DBMS. You could have a writable view that redirects the writes to the correct table. The problem then is that when you add a new child table, you need to rewrite your view.

Instead, we can use Postgres’ neat table inheritance to handle all of this for you. Indeed, it is discussed in the Postgres documentation.

If we inherit one table from another, and do a query on the parent table, we will also get the rows that match the query from all child tables. That obviates the need for a view that uses UNION ALL or similar to fetch the data.

I’m going to use a toy example here, that just contains a single column.

CREATE TABLE "data" ("value" TIMESTAMPTZ);

CREATE TABLE "data_2014" (
  CHECK (
    "value" >= '2014-01-01' AND "value" < '2015-01-01')
) INHERITS ("data");

CREATE TABLE "data_2015" (
  CHECK (
    "value" >= '2015-01-01' AND "value" < '2016-01-01')
) INHERITS ("data");

This, however, is only part of the picture. Any data that is added to either of the child tables (or indeed the parent table, but we won’t be doing that), will be returned when we query the parent table.

But we want to ensure that data is partitioned out nicely. For that, we can use a trigger.

A naïve trigger may look something like:

CREATE OR REPLACE FUNCTION data_insert_trigger()
RETURNS TRIGGER AS $$

BEGIN

  IF (NEW.value >= '2014-01-01' AND NEW.value < '2015-01-01') THEN
    INSERT INTO data_2014 VALUES (NEW.*);
  ELSIF (NEW.value >= '2015-01-01' AND NEW.value < '2016-01-01') THEN
    INSERT INTO data_2015 VALUES (NEW.*);
  ELSE
    RAISE EXCEPTION 'Date out of range. Please fix the data_insert_trigger() function.';
  END IF;

  RETURN NULL;
END;

$$ LANGUAGE plpgsql;

As you can see by the ELSE clause, we will actually need to do maintainence on this function as we start to get data that falls outside of our existing ranges. We will also need to create a new table for those rows.

It would be nice if we could automatically create tables that are missing, and handle any arbitrary values.

CREATE OR REPLACE FUNCTION data_insert_trigger()
RETURNS TRIGGER AS $$

DECLARE
  table_name text;
  year integer;
  start text;
  finish text;

BEGIN
  year := date_part('year', NEW.value);
  table_name := 'data_' || year;
  start := year || '-01-01';
  finish := (year + 1) || '-01-01';

  PERFORM 1 FROM pg_tables WHERE tablename = table_name;

  IF NOT FOUND THEN
    EXECUTE
      'CREATE TABLE '
      || quote_ident(table_name)
      || ' (CHECK ("value" >= '
      || quote_literal(start)
      || ' AND "value" < '
      || quote_literal(finish)
      || ')) INHERITS (data)';
  END IF;

  EXECUTE
    'INSERT INTO '
    || quote_ident(table_name)
    || ' VALUES ($1.*)'
  USING NEW;

  RETURN NULL;
END;

$$ LANGUAGE plpgsql;

CREATE TRIGGER data_insert_trigger
BEFORE INSERT ON data
FOR EACH ROW EXECUTE PROCEDURE data_insert_trigger();

You would also want to create any indexes on the child tables, as this is where they need to be, rather than the parent table.

This function is pretty neat: it first stores what the table name should be in a variable, as well as the two bounds for this table (start and finish). Then, we see if that table exists, and if not, create it. Finally, we then insert the values into the correct child table. I’m not sure I’d recommend using it as-is: it’s quite possibly subject to a race condition if two new records came in at the same time.

The one thing that was concerning me was that DDL changes to the parent table would not propagate to the child tables: however this turned out to not be an issue at all. Since I mostly use Django, I want as little hard stuff that would require custom migration operations.

ALTER TABLE data ADD COLUMN confirmed BOOLEAN DEFAULT false;

The other thing worth noting is that Postgres will do a really good job of limiting the tables that are accessed to those that contain the relevant data:

INSERT INTO data VALUES
  ('2014-01-06 09:00:00'),
  ('2015-01-09 12:00:00'),
  ('2016-02-22 15:39:00');

EXPLAIN
SELECT * FROM data
WHERE value > '2014-01-01'
AND value < '2014-07-01';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Append  (cost=0.00..42.10 rows=12 width=8)
   ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
   ->  Seq Scan on data_2014  (cost=0.00..42.10 rows=11 width=8)
         Filter: ((value > '2014-01-01 00:00:00'::timestamp with time zone) AND
                  (value < '2014-07-01 00:00:00'::timestamp with time zone))
 Planning time: 0.292 ms
(6 rows)

We see that only the data_2014 table is hit by the query. If your constraints do something like cast to DATE, then this may not happen. This was causing me some concern earlier, but letting Postgres coerce the data types fixed it.

However, you can’t use a tstzrange to query if you want these constraints to help the query planner:

-- Actually hits every table.
SELECT * FROM data WHERE value <@ '[2014-01-01, 2014-07-01)'::tstzrange;

It’s worth noting that if you change a value that would cause that row to belong in a different partition, this will fail.

There are moves afoot to have this feature more tightly integrated into Postgres, perhaps using a syntax like:

CREATE TABLE data (value TIMESTAMPTZ)
PARTITION BY RANGE (value)
(PARTITION data_2014 VALUES LESS THAN '2015-01-01');

CREATE PARTITION data_2015 ON data VALUES LESS THAN '2016-01-01';

It’s not clear to me how you actually define the range. It also seems counter-productive to have to manually create the partition tables.

There are also tools that might be useful to handle the heavy lifting, like pg_partman. I haven’t used this, but it looks interesting.

Boiled Chocolate Cake

Boil together:

  • 125g butter
  • 1 cup water
  • 1 1/2 cups sugar
  • 2 tablespoons cocoa
  • 1/2 teaspon bicarb soda

Simmer 5 mins.

Add:

  • 2 eggs
  • 1 1/2 cups self raising flour

Bake about 1/2 an hour at 180°C in 2 log tins or a round tin.

Don’t eat it all at once.

iOS 8 knows what you want to say

Had a mind-blowing moment when messaging using iOS 8 tonight.

Me: Can you get some more coffee beans please?

Her: Ok, caf or decaf?

This is where it gets awesome. The three choices I had in my autocomplete bar were:

  1. Ok
  2. Caf
  3. Decaf

So, it apparently parses messages, and sets the choices accordingly.

I only wish I had a screenshot. That would have been less explaining.