I have 20 tables with identical structure, and I need to aggregate the results.
I'm using UNION ALL to collect the results, in the following structure:
SELECT something, SUM(total) AS overall_total
FROM (
SELECT something, SUM(something_else) AS total
FROM table1
GROUP BY something
UNION ALL
SELECT something, SUM(something_else) AS total
FROM table2
GROUP BY something
UNION ALL
...
) AS X
GROUP BY something
ORDER BY overall_total DESC
LIMIT 10
In this case the DB uses Parallel Append with multiple workers, which is very fast.
However, each table returns a large amount of data so aggregation is slow.
To improve performance, I want to limit the number of rows returned from each table to the top 100 results only, since the overall will always rely on those.
I tried using either LIMIT or ROW_NUMBER() OVER() to retrieve only the top rows from each table:
SELECT something, SUM(something_else) AS total
FROM table2
GROUP BY something
ORDER BY total DESC
LIMIT 100
SELECT something, SUM(something_else) AS total, ROW_NUMBER() OVER() as r
FROM table2
WHERE r <= 100
GROUP BY something
ORDER BY total DESC
But now the plan never uses Parallel Append, so data fetch is very slow and overall performance is worse. Can't get the parallel plan even using flags like:
set force_parallel_mode = on;
set max_parallel_workers_per_gather = 10;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
Reading PostgreSQL documentation, parallelization can't be done when using unsafe operations. But it didn't mention LIMIT and ROW_NUMBER as being unsafe.
Why?
I can reproduce the issue. As soon as I add a
LIMITclause to one of theSELECTterms of theUNION ALL, the parallel plan goes away, even withset force_parallel_mode = on;.Maybe it's connected to this comment in the source code:
But I am not entirely sure. This applies to "subquery in FROM". But it would seem it shouldn't keep Postgres from scheduling parallel workers for whole subqueries in
UNION ALLin aParallel Appendnode. Maybe just a shortcoming in current Postgres 15?Test case
See:
You may or may not need / want
NULLS LAST:I made the sort order deterministic by adding
somethingas secondORDER BYitem. Else, repeated calls might get you different results.Workaround
We can work around the limitation by encapsulating each root query in a function with a
PARALLEL SAFElabel:I chose a generic function with dynamic SQL and
EXECUTE, passing table names safely. If you are unfamiliar, here is more about the basics:I also declared
ROWS 100(because we know that) and, more importantly,COST 1000000to make the query planner consider a parallel plan. You may want to fiddle with theCOSTsetting to influence the query plan ...You can also write one simple SQL function for each table instead.
Either way, this now allows parallel plans:
Aside, your second attempt with the window function
row_number()is illegal syntax. You cannot reference the output columnrin theWHEREclause. You would need a subquery to make it work. But a single subquery level would suffice, as you can wrap the window function around the aggregate function:That said, the first approach with
LIMITversion is simpler and faster.