Closure Tables for Browsing Trees in SQL
The old "parent" problem in tree-based data sets is an old one. The question goes, "how do you get a set of data with parents and children quickly, without having to write a recursive or massive nested join query?" The most common use cases for this are page structure, folders, or threaded comments.
Nested Sets are one solution, but can often be heavy and inflexible. One of my preferred ways of grabbing such data is to use a structure called a "Closure Table", in which you store all paths from one node in the tree to another. This solution allows for node traversals of the tree in O(N) time or less. The tradeoff is obviously space, as you are now storing quite a bit more data in terms of pathing, but this is generally preferred.
Getting Started
The setup is simple. We'll use a threaded comments system as an example. Let's assume we have the following tables (simplified for this tutorial):
comment
- id
- text
- parent
- rank
comment_closure
- ancestor
- descendant
- depth
Our comments table has simply an auto-generated ID of the comment, the text, and the parent comment ID. It also has a "rank", which we'll get to later. Next is the closure table for the comments, in which we store "ancestor" (the parent), "descendant" (the child), and "depth", which is the depth level from the root of the tree.
Let's say we have a tree of comments like this:
1
| - 2
| - 3
| - 6
| - 5
| - 4
Say we want to grab comment 2 and all its children - in one query. Doing a straight order by ID won't work, as the 6th comment was posted after the 5th, but is at a different place in the tree browsing. A createdby won't work either. So, this is where our closure tables come in.
The Closure Table
Our closure data for this tree looks something like this:
INSERT INTO `comments_closure` (`ancestor`, `descendant`, `depth`) VALUES
(1, 1, 0),
(0, 1, 0),
(2, 2, 0),
(1, 2, 1),
(0, 2, 1),
(3, 3, 0),
(1, 3, 2),
(2, 3, 1),
(0, 3, 2),
(4, 4, 0),
(1, 4, 1),
(0, 4, 1),
(5, 5, 0),
(1, 5, 2),
(2, 5, 1),
(0, 5, 2),
(6, 6, 0),
(1, 6, 3),
(2, 6, 2),
(3, 6, 1),
(0, 6, 3);
```
You can see the ancestry in the table here - each comment has a full mapping of how it relates to every other comment in its lineage. So, now we can write our query:
```
SELECT `id`,`parent` FROM `comments` `comment`
JOIN `comments_closure` `closure`
ON `comment`.`id` = `closure`.`descendant`
WHERE `closure`.`ancestor` = 2
```
This returns us this nice table:
id parent
2 1
3 2
5 2
6 3
````
Great! Except one thing. The comments aren't sorted; 6 should be before 5. Well, that makes it hard to read through a comment thread, for sure. We need rank.
Ranking
Rank is actually a tricky thing to do; you can't just insert a number in a column and make it the rank - doing so would mean that every time a deep comment gets inserted, you'd have to readjust the rank for every comment in the thread. In large threads, this could take a loooong time, and generally isn't a smart way of doing it.
I prefer a neat little solution that relies on string padding. To show you, I'm going to calculate the rank of the comment ID 6, which is nice and deep in the thread (3 levels). To start, we'll need a field on the comments
table called "rank", and let's make it a TINYTEXT type.
We'll assume that the very root comment has its rank already filled out. The format we're going to use for rank for that comment is:
0000000001
Basically I'm just padding the ID of the comment up to 10 spaces. You can make it higher, if you want - it's an arbitrary padding - but 10 is mostly enough to fit most people's requirements. We'll also assume every other comment has its rank sorted out (except our ID 6 comment), and they should look like this:
id rank
1 0000000001
2 0000000001-0000000002
3 0000000001-0000000002-0000000003
4 0000000001-0000000004
5 0000000001-0000000002-0000000005
Note how we're basically mapping out the ancestry of the node with the rank, but using the auto-incremented ID to our benefit. We know that a lower ID means that it was created earlier, and we want to display earlier comments above later comments, so we can use the auto-ID to determine rank. (If you aren't using an auto-ID, you can replace this with some other enumerated, rankable field.) Secondly, we're separating them with a dash. You can use whatever delimiter you want here.
Back to ID 6. Every time we save that comment and change its parent, we'll need to update it's rank. To do so, we'll cheat and grab it's parent's rank. We have that stored on the comments
table, and therefore in our comment with ID 6, so let's assume we have that ID of the parent (which here is 3):
SELECT `rank` FROM `comments` WHERE `id` = 3
Now that will give us this:
0000000001-0000000002-0000000003
And we can just postfix that with -0000000006
0000000001-0000000002-0000000003-0000000006
And viola! We have a rank for our field. Now we can run our query from earlier to get a sorted result with a couple modifications:
SELECT `id`,`parent`,`rank` FROM `comments` `comment`
JOIN `comments_closure` `closure` ON `comment`.`id` = `closure`.`descendant`
WHERE `closure`.`ancestor` = 2
ORDER BY `rank` ASC
And get our result:
id parent rank
2 1 0000000001-0000000002
3 2 0000000001-0000000002-0000000003
6 3 0000000001-0000000002-0000000003-0000000006
5 2 0000000001-0000000002-0000000003
That's it! Closure tables provide a powerful lookup solution for large trees, which is used in a variety of threaded content applications. They push off the calculation and storage costs to destructive (POST/PUT) methods, and let the READ methods be nice and fast with minimal overhead. There's a lot more to this (such as paginating properly over sub-threads, the insert and remove methods, recalculating rank), but this tutorial should get you started.
Written by Shaun McCormick
Related protips
6 Responses
Hi
Why are you padding the rank?
Why can you not just use 1-2-3-6
Regards?
@freshface: I'm not entirely sure, but here are some ideas:
- padding it easier to write queries like: "%0000000004%" whereas the same thing without padding would require some thing like: "4" OR "4-%" OR "%-4-%" OR "%-4" to prevent any id with 4 in it to match under certain circumstances.
- padding makes it simpler and faster to select out the specific ids: rank.substr(0, 10) instead of rank.substr(0, rank.find('-') or len(rank)).
I think the output of ordering by rank is incorrect. Why it isn't outputted as:
...
3 2 0000000001-0000000002-0000000003
5 2 0000000001-0000000002-0000000003
6 3 0000000001-0000000002-0000000003-0000000006
Thank you so much for this article! One quick question though... what is the purpose of the comments_closure entries with a ancestor id of '0' such as (0, 1, 0)? I am still trying to wrap my head around closure tables, and many of the examples I have seen elsewhere do not have those rows. I get why you need the self references with a depth of '0", but the ancestor '0' rows seem redundant. I feel like I am missing something... is there some advantage to having root level rows like that?
The closing </pre>
tag in your first example is broken and makes the rest of the article nigh-unreadable. Please fix, because this is the best one about Closure Tables.
Hi,
thanks for the very useful article.
If I understood everything correctly, there seem to be two typos:
1) The rank of comment "5" is "0000000001-0000000002-0000000005", when first mentioned. In the result, it is listed as "0000000001-0000000002-0000000003", but still showing last. If it would still be listed as "0000000001-0000000002-0000000005", the sorting order would make sense and aireol would be happy. ;-)
2) The last entry of the closure table should be "(0, 6, 4);" instead of "(0, 6, 3);"
Best regards