How to rank over result of window function in postgres efficiently?

50 Views Asked by At

I have a db table like shown below:

                              Table "public.counters_by_app_category_app"
                 Column                 |            Type             | Collation | Nullable | Default
----------------------------------------+-----------------------------+-----------+----------+---------
 timestamp                              | timestamp without time zone |           | not null |
 granularity                            | interval                    |           | not null |
 app_category                           | text                        |           | not null |
 app                                    | text                        |           | not null |
 data_volume_up                         | numeric                     |           | not null | 0
 data_volume_down                       | numeric                     |           | not null | 0

and some sample rows

      timestamp      | granularity | app_category |  app  | data_volume_up | data_volume_down
---------------------+-------------+--------------+-------+----------------+------------------
 2023-09-24 04:30:00 | 00:30:00    | 17           | 25301 |         458890 |           479658
 2023-09-24 04:30:00 | 00:30:00    | 54           | 29877 |          38131 |           185370
 2023-09-24 04:30:00 | 00:30:00    | 81           | 22163 |          27319 |            32569

Suppose I want to find top 10 (app_category, app) pairs by volume = data_volume_up + data _volume_down, for a time-range of a day and then get all data from the table for these top 10 pairs and the same time-range.

The easy way is with 2 queries: 1st find top 10 with a group by query and an aggregate (SUM) function

mydatabase=> select granularity, app_category, app, sum(data_volume_up+data_volume_down) as vol from counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
group by granularity, app_category, app
order by vol desc limit 10;
 granularity | app_category |  app  |     vol
-------------+--------------+-------+--------------
 00:30:00    | 26           | 3470  | 264871859641
 00:30:00    | 17           | 8229  | 159708538857
 00:30:00    | 15           | 6583  |  90496671008
 00:30:00    | 24           | 2059  |  89593472005
 00:30:00    | 26           | 3212  |  70312526765
 00:30:00    | 66           | 8578  |  63943121581
 00:30:00    | 3            | 2783  |  58159595285
 00:30:00    | 2            | 4568  |  54144586115
 00:30:00    | 17           | 25296 |  35847681517
 00:30:00    | 4            | 12    |  27602121737
(10 rows)

Then get rows from the table for the same time-range and granularity, using in where clause the pairs found in "find" step.

I tried to use a window function to achieve the same but with one query. Here is my approach:

Get total volume per (app_category, app) for the whole day with:

mydatabase=> SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up,
data_volume_down,
SUM(data_volume_up+data_volume_down) OVER(PARTITION BY app_category, app) as total_vol_order
FROM counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
order by timestamp, total_vol_order desc;
      timestamp      | granularity | app_category |  app  | data_volume_up | data_volume_down | total_vol_order
---------------------+-------------+--------------+-------+----------------+------------------+-----------------
 2023-09-28 00:00:00 | 00:30:00    | 26           | 3470  |      195256369 |       3862177889 |    264871859641
 2023-09-28 00:00:00 | 00:30:00    | 17           | 8229  |       58679885 |       4271976092 |    159708538857
 2023-09-28 00:00:00 | 00:30:00    | 15           | 6583  |      105577241 |       4569555704 |     90496671008
 2023-09-28 00:00:00 | 00:30:00    | 24           | 2059  |           7166 |            74220 |     89593472005
...

But want to keep only the top 10 for each timestamp, so I want to rank rows based on the result of the window function and keep top 10 rows. Here is how I did it:

mydatabase=> with data as (
SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up+data_volume_down as total_vol,
total_vol_order,
DENSE_RANK() OVER(order by total_vol_order DESC) as drank
FROM (
SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up,
data_volume_down,
SUM(data_volume_up+data_volume_down) OVER(PARTITION BY app_category, app) as total_vol_order
FROM counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
) as ss
)
select * from data
where drank <=10
order by timestamp, drank;
      timestamp      | granularity | app_category |  app  | total_vol  | total_vol_order | drank
---------------------+-------------+--------------+-------+------------+-----------------+-------
 2023-09-28 00:00:00 | 00:30:00    | 26           | 3470  | 4057434258 |    264871859641 |     1
 2023-09-28 00:00:00 | 00:30:00    | 17           | 8229  | 4330655977 |    159708538857 |     2
 2023-09-28 00:00:00 | 00:30:00    | 15           | 6583  | 4675132945 |     90496671008 |     3
 2023-09-28 00:00:00 | 00:30:00    | 24           | 2059  |      81386 |     89593472005 |     4
 2023-09-28 00:00:00 | 00:30:00    | 26           | 3212  | 1640757311 |     70312526765 |     5
 2023-09-28 00:00:00 | 00:30:00    | 66           | 8578  | 1047484505 |     63943121581 |     6
 2023-09-28 00:00:00 | 00:30:00    | 3            | 2783  |  167416738 |     58159595285 |     7
 2023-09-28 00:00:00 | 00:30:00    | 2            | 4568  |  290666849 |     54144586115 |     8
 2023-09-28 00:00:00 | 00:30:00    | 17           | 25296 | 1149896605 |     35847681517 |     9
 2023-09-28 00:00:00 | 00:30:00    | 4            | 12    |  582563699 |     27602121737 |    10
 2023-09-28 00:30:00 | 00:30:00    | 26           | 3470  | 5608569787 |    264871859641 |     1
 2023-09-28 00:30:00 | 00:30:00    | 17           | 8229  | 4125175834 |    159708538857 |     2

Is there any other more optimal way to do it, eliminating the subqueries and with statements?

0

There are 0 best solutions below