Data structure: The table has an PK 'id' and a jsonb column 'data'. The 'data' contains an array of objects 'instances'. Each 'instance' has some values and a 'path' array. The 'path' array is a flat representation of a deeply nested tree like hierarchical structure. Each 'path' consists of objects that have a string 'id', which is not unique, and an integer 'index' which is only unique on the same level (the same parent structure), but can repeat on different levels.
Example:
"instances": [
{
"path": [
{"id": "root", "index": 2},
{"id": "folder1", "index": 0},
{"id": "folder2", "index": 0},
{"id": "folder3", "index": 0}
],
"pdf": "pdf in 1,2,3",
"info": "some other data"
},
...
]
I need to be able to search for multiple specific values within the hierarchy of the same specified folder in a Postgres jsonb table.
For example, search for an item that has both "pdf in 1,2,3" AND "text in 1,2,3" values within the hierarchy of the same folder2 (meaning the folder2 within the same parent structure).
Here's the query I came up with:
WITH indexed_paths AS (
SELECT id, instance -> 'index' as inst_idx,
MIN(CASE WHEN path_element @> '{"id": "folder2"}' THEN path_idx END)
OVER (PARTITION BY id, instance -> 'index') AS searched_index,
path_idx, path_element, instance
FROM "flat",
jsonb_array_elements(data -> 'instances') AS instance,
jsonb_array_elements(instance -> 'path') WITH ORDINALITY arr(path_element, path_idx)
WHERE
instance -> 'path' @> '[{"id": "folder2"}]'
ORDER BY id, inst_idx, path_idx
), combined_paths AS (
SELECT id, jsonb_agg(path_element) as path, instance
FROM indexed_paths
WHERE path_idx <= searched_index
GROUP BY id, inst_idx, instance
), combined_instances AS (
SELECT id, path, jsonb_agg(instance) as instances
FROM combined_paths
GROUP BY id, path
)
SELECT *
FROM "flat" f
WHERE EXISTS (
SELECT 1
FROM combined_instances ci
WHERE
ci.id = f.id
AND ci.instances @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb
);
-
indexed_paths CTE expands each row's instances and each path of each instance into a separate row of 'path_element's, enumerating all the path_element's with indexes. If the path_element is the searched one, it takes it's index and writes it to a new column of all the rows with a matching id and instance index. if there are multiple occurrences of the searched path_element within the same id and instance index it takes the smallest one.
combined_paths CTE aggregates path_elements grouping them by id's and instance index, checking if the element_path index is <= to the searched index, this way reconstructing element_paths back but only up to the searched path_element.
combined_instances CTE aggregates the data of all instances by matching id's and reconstructed paths.
final SELECT statement is searching for the specified data inside the combined_instances and joins it with the original table by id's.*
It works exactly the way I want, but it's just too verbose and long. Is there any way to simplify it? Changing the data structure of the jsonb column is an option. Some other algotythm is also welcome. Basically any haelp would be gretly appreciated.
I'd try to use subqueries and lateral joins more instead of expanding and re-grouping rows in multiple CTEs:
(online demo of this and the below approaches)
or with
EXISTS, if you want only the wholeflatrow (regardless how many matches there are within itsdata):An alternative to the
GROUP BYand searching in the aggregatedinstancesarray would be a self-join of a CTE relation:Instead of returning
ancestor_nameas a separate column from the subquery, which makes theGROUP BYorJOIN … USINGmore ugly, you can also just access the last element ofancestor_path:I guess the main trick that makes these queries shorter than yours is the use of array slicing to generate the ancestor path(s) and the use of
jsonb_to_recordfor converting tojsonbarray to a postgres array. This probably could have been achieved in any number of ways, including your window function (which notably only finds the upper path if there are two nested "folder2"s):