The old "parent" problem in tree-based data sets is an old one. The question goes, "how do you get a set of data with parents and children quickly, without having to write a recursive or massive nested join query?" The most common use cases for this are page structure, folders, or threaded comments.
Nested Sets are one solution, but can often be heavy and inflexible. One of my preferred ways of grabbing such data is to use a structure called a "Closure Table", in which you store all paths from one node in the tree to another. This solution allows for node traversals of the tree in O(N) time or less. The tradeoff is obviously space, as you are now storing quite a bit more data in terms of pathing, but this is generally preferred.
The setup is simple. We'll use a threaded comments system as an example. Let's assume we have the following tables (simplified for this tutorial):
comment - id - text - parent - rank comment_closure - ancestor - descendant - depth
Our comments table has simply an auto-generated ID of the comment, the text, and the parent comment ID. It also has a "rank", which we'll get to later. Next is the closure table for the comments, in which we store "ancestor" (the parent), "descendant" (the child), and "depth", which is the depth level from the root of the tree.
Let's say we have a tree of comments like this:
1 | - 2 | - 3 | - 6 | - 5 | - 4
Say we want to grab comment 2 and all its children - in one query. Doing a straight order by ID won't work, as the 6th comment was posted after the 5th, but is at a different place in the tree browsing. A createdby won't work either. So, this is where our closure tables come in.
The Closure Table
Our closure data for this tree looks something like this:
INSERT INTO `comments_closure` (`ancestor`, `descendant`, `depth`) VALUES (1, 1, 0), (0, 1, 0), (2, 2, 0), (1, 2, 1), (0, 2, 1), (3, 3, 0), (1, 3, 2), (2, 3, 1), (0, 3, 2), (4, 4, 0), (1, 4, 1), (0, 4, 1), (5, 5, 0), (1, 5, 2), (2, 5, 1), (0, 5, 2), (6, 6, 0), (1, 6, 3), (2, 6, 2), (3, 6, 1), (0, 6, 3);
You can see the ancestry in the table here - each comment has a full mapping of how it relates to every other comment in its lineage. So, now we can write our query:
SELECT `id`,`parent` FROM `comments` `comment` JOIN `comments_closure` `closure` ON `comment`.`id` = `closure`.`descendant` WHERE `closure`.`ancestor` = 2
This returns us this nice table:
id parent 2 1 3 2 5 2 6 3
Great! Except one thing. The comments aren't sorted; 6 should be before 5. Well, that makes it hard to read through a comment thread, for sure. We need rank.
Rank is actually a tricky thing to do; you can't just insert a number in a column and make it the rank - doing so would mean that every time a deep comment gets inserted, you'd have to readjust the rank for every comment in the thread. In large threads, this could take a loooong time, and generally isn't a smart way of doing it.
I prefer a neat little solution that relies on string padding. To show you, I'm going to calculate the rank of the comment ID 6, which is nice and deep in the thread (3 levels). To start, we'll need a field on the
comments table called "rank", and let's make it a TINYTEXT type.
We'll assume that the very root comment has its rank already filled out. The format we're going to use for rank for that comment is:
Basically I'm just padding the ID of the comment up to 10 spaces. You can make it higher, if you want - it's an arbitrary padding - but 10 is mostly enough to fit most people's requirements. We'll also assume every other comment has its rank sorted out (except our ID 6 comment), and they should look like this:
id rank 1 0000000001 2 0000000001-0000000002 3 0000000001-0000000002-0000000003 4 0000000001-0000000004 5 0000000001-0000000002-0000000005
Note how we're basically mapping out the ancestry of the node with the rank, but using the auto-incremented ID to our benefit. We know that a lower ID means that it was created earlier, and we want to display earlier comments above later comments, so we can use the auto-ID to determine rank. (If you aren't using an auto-ID, you can replace this with some other enumerated, rankable field.) Secondly, we're separating them with a dash. You can use whatever delimiter you want here.
Back to ID 6. Every time we save that comment and change its parent, we'll need to update it's rank. To do so, we'll cheat and grab it's parent's rank. We have that stored on the
comments table, and therefore in our comment with ID 6, so let's assume we have that ID of the parent (which here is 3):
SELECT `rank` FROM `comments` WHERE `id` = 3
Now that will give us this:
And we can just postfix that with
And viola! We have a rank for our field. Now we can run our query from earlier to get a sorted result with a couple modifications:
SELECT `id`,`parent`,`rank` FROM `comments` `comment` JOIN `comments_closure` `closure` ON `comment`.`id` = `closure`.`descendant` WHERE `closure`.`ancestor` = 2 ORDER BY `rank` ASC
And get our result:
id parent rank 2 1 0000000001-0000000002 3 2 0000000001-0000000002-0000000003 6 3 0000000001-0000000002-0000000003-0000000006 5 2 0000000001-0000000002-0000000003
That's it! Closure tables provide a powerful lookup solution for large trees, which is used in a variety of threaded content applications. They push off the calculation and storage costs to destructive (POST/PUT) methods, and let the READ methods be nice and fast with minimal overhead. There's a lot more to this (such as paginating properly over sub-threads, the insert and remove methods, recalculating rank), but this tutorial should get you started.