Last Updated: August 01, 2022

PostgreSQL: Flatten hierarchy in one query

Storing data in a hierarchy allows for flexibility and cleanliness at once. A simple parent-child relationship, much like how files and directories are laid-out on a file system or in cloud storage.

A basic hierarchy would look like this, using populations as our data type (think classifications of creatures):

create table populations (
  id        integer not null,
  parent_id integer null
  name      character varying not null
)

id  parent_id  name
-------------------
1   NULL       Clam
2   1          Giant
3   2          Boring
4   NULL       Moray Eel
5   4          Green
6   4          White-Eyed
7   5          Giant

This allows someone to easily add creatures and sub-classifications all they like without having many tables with similar or the same structure.

Now it comes to querying. You have to answer requests like:

How many kinds of clam are there?
Show me all types of creatures with their full "path"
Show all related records that are any kind of Moray Eel

You can easily find yourself writing application code that causes N+1 queries, building big lists, nested hashes, strings, etc, all while degrading performance and decreasing code legibility.

Thanks to recursive common table expression in PostgreSQL we can solve most problems in a single query.

Here's the full query, and then we'll break it down:

WITH RECURSIVE pops (id, level, name, name_path) AS (
    SELECT  id, 0, name, ARRAY[name]
    FROM    populations
    WHERE   parent_id is null

    UNION ALL

    SELECT  p.id, t0.level + 1, p.name, ARRAY_APPEND(t0.name_path, p.name)
    FROM    populations p
            INNER JOIN pops t0 ON t0.id = p.parent_id
)

SELECT  id, level, name_path[1] AS category, ARRAY_TO_STRING(name_path, ' > ')
FROM    pops

This starts with WITH RECURSIVE which tells PostgreSQL we're introducing a recursive CTE. pops is the alias I've given it, followed by a list of fields that will be in the table. We don't specify types; they're copied from the "root" query.

Next, we have a simple SELECT statement. This is where we seed our recursive CTE. We pick the rows from the underlying table that represent the root data: data that has no parent. I also added a few bits of data here, level to indicate how deep a record is nested, and then initialised an ARRAY that we can build on to generate paths for child records. This query is executed only once.

Next, there's UNION ALL statement to indicate we're going to start the recursive query. The next SELECT statement is where we find child records. You see here that we join over to pops, which is the current set of records for that CTE. This is where we're adding all the child records. This query is executed until there are no results returned.

A couple things we're doing here to make things easier and cleaner:

Using ARRAY_APPEND, we're adding the name of this element onto the names from previous elements. This allows us to construct the path later however we see fit.
We increase the level by 1. It may not matter to you, but there are cases where knowing the level can make things easier, or indicate a problem with your data depending on you business needs

Lastly, we query the pops CTE. Using name_path[1] we can indicate a "category" of the record. This is the name of the root nodes. Then. ARRAY_TO_STRING concatenates the array elements using the second argument as the separator. Just like a join method you'd find in JavaScript or Ruby. You could pass this array directly to the application and let it join/concat values as needed too.

Our results!

id level category   path
1  0     Clam       Clam
2  1     Clam       Clam > Giant
3  1     Clam       Clam > Boring
4  0     Moray Eel  Moray Eel
5  1     Moray Eel  Moray Eel > Green
6  1     Moray Eel  Moray Eel > White-Eyed
7  2     Moray Eel  Moray Eel > Green > Giant

This approach has the benefit of good performance, flexibility, and once you understand CTEs, it's quite approachable. Feel free to ask questions in the comments and I'll help if I'm able.