PostgreSQL: Flatten hierarchy in one query
Storing data in a hierarchy allows for flexibility and cleanliness at once. A simple parent-child relationship, much like how files and directories are laid-out on a file system or in cloud storage.
A basic hierarchy would look like this, using populations as our data type (think classifications of creatures):
create table populations (
id integer not null,
parent_id integer null
name character varying not null
)
id parent_id name
-------------------
1 NULL Clam
2 1 Giant
3 2 Boring
4 NULL Moray Eel
5 4 Green
6 4 White-Eyed
7 5 Giant
This allows someone to easily add creatures and sub-classifications all they like without having many tables with similar or the same structure.
Now it comes to querying. You have to answer requests like:
- How many kinds of clam are there?
- Show me all types of creatures with their full "path"
- Show all related records that are any kind of Moray Eel
You can easily find yourself writing application code that causes N+1 queries, building big lists, nested hashes, strings, etc, all while degrading performance and decreasing code legibility.
Thanks to recursive common table expression in PostgreSQL we can solve most problems in a single query.
Here's the full query, and then we'll break it down:
WITH RECURSIVE pops (id, level, name, name_path) AS (
SELECT id, 0, name, ARRAY[name]
FROM populations
WHERE parent_id is null
UNION ALL
SELECT p.id, t0.level + 1, p.name, ARRAY_APPEND(t0.name_path, p.name)
FROM populations p
INNER JOIN pops t0 ON t0.id = p.parent_id
)
SELECT id, level, name_path[1] AS category, ARRAY_TO_STRING(name_path, ' > ')
FROM pops
This starts with WITH RECURSIVE
which tells PostgreSQL we're introducing a recursive CTE. pops
is the alias I've given it, followed by a list of fields that will be in the table. We don't specify types; they're copied from the "root" query.
Next, we have a simple SELECT
statement. This is where we seed our recursive CTE. We pick the rows from the underlying table that represent the root data: data that has no parent. I also added a few bits of data here, level
to indicate how deep a record is nested, and then initialised an ARRAY
that we can build on to generate paths for child records. This query is executed only once.
Next, there's UNION ALL
statement to indicate we're going to start the recursive query. The next SELECT
statement is where we find child records. You see here that we join over to pops
, which is the current set of records for that CTE. This is where we're adding all the child records. This query is executed until there are no results returned.
A couple things we're doing here to make things easier and cleaner:
- Using
ARRAY_APPEND
, we're adding the name of this element onto the names from previous elements. This allows us to construct the path later however we see fit. - We increase the
level
by 1. It may not matter to you, but there are cases where knowing the level can make things easier, or indicate a problem with your data depending on you business needs
Lastly, we query the pops
CTE. Using name_path[1]
we can indicate a "category" of the record. This is the name of the root nodes. Then. ARRAY_TO_STRING
concatenates the array elements using the second argument as the separator. Just like a join
method you'd find in JavaScript or Ruby. You could pass this array directly to the application and let it join/concat values as needed too.
Our results!
id level category path
1 0 Clam Clam
2 1 Clam Clam > Giant
3 1 Clam Clam > Boring
4 0 Moray Eel Moray Eel
5 1 Moray Eel Moray Eel > Green
6 1 Moray Eel Moray Eel > White-Eyed
7 2 Moray Eel Moray Eel > Green > Giant
This approach has the benefit of good performance, flexibility, and once you understand CTEs, it's quite approachable. Feel free to ask questions in the comments and I'll help if I'm able.