Postgres Tree Shootout part 1: Introduction.

2014-11-22 @ 20:45:50

here.

I’ve written before about using Adjacency Lists, and even done some performance tests on querying them. Whilst reading a post today, it occurred to me that it might be worthwhile to do a comparison of the various methods of storing hierarchical data in Postgres, and the costs of the same operations on each of those.

This post is just an introduction, with an outline of what I plan to do to run these tests. Please feel free to suggest things that I have missed, or that might be an oversight at my end.

Tree Models

There are four methods of storing the relationships that might form a tree. This analysis will be limited to actual tree, rather than graph structures (no cycles). I plan to detail the data structures in a series of posts, one per method. In each case, where there are multiple ways to store the data, I will attempt to examine each of these.

Adjacency Lists, being the simplest to understand (and the ones I have spent more time on recently), will be discussed first.

Path Enumerations will be next, with a comparison of storing the data using the ltree extension, and using an ARRAY column.

Following this, I’ll make an attempt at using the Closure Table model: where each ancestor-descendant relationship is stored, rather than just the parent-child relationship.

Finally, I’ll have a crack at the Nested Set model. I’m not solidly behind this model for the types of data I’ve had to deal with, but it is a valid mechanism for storing and retrieving this data. Besides, it will be an interesting exercise to implement.

My plan to handle all of these is that all tree manipulation should be “automatic”, that is, adding a node (or removing one, or whatever) should not require explicit updating of the various metadata. This should all be handled by trigger functions on the tables themselves. Whether this turns out to be reasonable we shall see.

Operations

I plan to perform the same set of operations (on the same data, rather than randomly generated data) in all models, and compare the complexity and run-time of the various queries. I’m hoping to cover all of the operations that might be performed on a tree structure, so please add any more to the comments.

The data stored in the table will contain more than one tree: this means we can perform operations which add/remove root nodes/whole trees.

Insertions

Insert a single leaf node
Insert a single root node (all existing root nodes will then point to this)
Insert a single node partway through the tree, with the “replaced” node becoming a child of this (and this one keeps it’s children)
Insert a single node partway through the tree, with this node’s parent’s existing children all becoming children of the new node.

Removals

Remove a single leaf node
Remove a single root node (all children of this are promoted to root nodes)
Remove a single node partway through the tree: all children of this node then have their grand-parent as their parent.
Remove a subtree (a single non-root node and it’s descendants)
Remove a whole tree (a single root node and all descendants)
Remove all descendants of a specific node (but not the node itself)

Moves

Move a subtree from one parent to another
Move a single leaf node to a different parent
Move a root node into a tree
Make a subtree into a tree (turn a node into a root node).
Move all children of a node to a different parent

Fetches

Fetch all descendants of a given node
Fetch the number of descendants of a given node
Fetch descendants of a given node to a given depth
Fetch the number of descendants of a given node to a given depth
Fetch all ancestors of a given node
Fetch the number of ancestors of a given node
Fetch ancestors of a given node to a given depth
~~Fetch the number of ancestors of a given node to a given depth~~
I don’t think this makes any sense.
Fetch all leaf nodes
Fetch the number of leaf nodes
Fetch all root nodes
Fetch the number of root nodes
Fetch all non-leaf nodes
Fetch the number of non-leaf nodes
Fetch all non-root nodes
Fetch the number of non-root nodes
Fetch all non-root, non-leaf nodes
Fetch the number of non-root, non-leaf nodes