I have a db table like shown below:
Table "public.counters_by_app_category_app"
Column | Type | Collation | Nullable | Default
----------------------------------------+-----------------------------+-----------+----------+---------
timestamp | timestamp without time zone | | not null |
granularity | interval | | not null |
app_category | text | | not null |
app | text | | not null |
data_volume_up | numeric | | not null | 0
data_volume_down | numeric | | not null | 0
and some sample rows
timestamp | granularity | app_category | app | data_volume_up | data_volume_down
---------------------+-------------+--------------+-------+----------------+------------------
2023-09-24 04:30:00 | 00:30:00 | 17 | 25301 | 458890 | 479658
2023-09-24 04:30:00 | 00:30:00 | 54 | 29877 | 38131 | 185370
2023-09-24 04:30:00 | 00:30:00 | 81 | 22163 | 27319 | 32569
Suppose I want to find top 10 (app_category, app) pairs by volume = data_volume_up + data _volume_down, for a time-range of a day and then get all data from the table for these top 10 pairs and the same time-range.
The easy way is with 2 queries: 1st find top 10 with a group by query and an aggregate (SUM) function
mydatabase=> select granularity, app_category, app, sum(data_volume_up+data_volume_down) as vol from counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
group by granularity, app_category, app
order by vol desc limit 10;
granularity | app_category | app | vol
-------------+--------------+-------+--------------
00:30:00 | 26 | 3470 | 264871859641
00:30:00 | 17 | 8229 | 159708538857
00:30:00 | 15 | 6583 | 90496671008
00:30:00 | 24 | 2059 | 89593472005
00:30:00 | 26 | 3212 | 70312526765
00:30:00 | 66 | 8578 | 63943121581
00:30:00 | 3 | 2783 | 58159595285
00:30:00 | 2 | 4568 | 54144586115
00:30:00 | 17 | 25296 | 35847681517
00:30:00 | 4 | 12 | 27602121737
(10 rows)
Then get rows from the table for the same time-range and granularity, using in where clause the pairs found in "find" step.
I tried to use a window function to achieve the same but with one query. Here is my approach:
Get total volume per (app_category, app) for the whole day with:
mydatabase=> SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up,
data_volume_down,
SUM(data_volume_up+data_volume_down) OVER(PARTITION BY app_category, app) as total_vol_order
FROM counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
order by timestamp, total_vol_order desc;
timestamp | granularity | app_category | app | data_volume_up | data_volume_down | total_vol_order
---------------------+-------------+--------------+-------+----------------+------------------+-----------------
2023-09-28 00:00:00 | 00:30:00 | 26 | 3470 | 195256369 | 3862177889 | 264871859641
2023-09-28 00:00:00 | 00:30:00 | 17 | 8229 | 58679885 | 4271976092 | 159708538857
2023-09-28 00:00:00 | 00:30:00 | 15 | 6583 | 105577241 | 4569555704 | 90496671008
2023-09-28 00:00:00 | 00:30:00 | 24 | 2059 | 7166 | 74220 | 89593472005
...
But want to keep only the top 10 for each timestamp, so I want to rank rows based on the result of the window function and keep top 10 rows. Here is how I did it:
mydatabase=> with data as (
SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up+data_volume_down as total_vol,
total_vol_order,
DENSE_RANK() OVER(order by total_vol_order DESC) as drank
FROM (
SELECT
timestamp,
granularity,
app_category,
app,
data_volume_up,
data_volume_down,
SUM(data_volume_up+data_volume_down) OVER(PARTITION BY app_category, app) as total_vol_order
FROM counters_by_app_category_app
WHERE timestamp >= '2023-09-28 00:00:00' AND timestamp < '2023-09-29 00:00:00' AND granularity = '00:30:00'
) as ss
)
select * from data
where drank <=10
order by timestamp, drank;
timestamp | granularity | app_category | app | total_vol | total_vol_order | drank
---------------------+-------------+--------------+-------+------------+-----------------+-------
2023-09-28 00:00:00 | 00:30:00 | 26 | 3470 | 4057434258 | 264871859641 | 1
2023-09-28 00:00:00 | 00:30:00 | 17 | 8229 | 4330655977 | 159708538857 | 2
2023-09-28 00:00:00 | 00:30:00 | 15 | 6583 | 4675132945 | 90496671008 | 3
2023-09-28 00:00:00 | 00:30:00 | 24 | 2059 | 81386 | 89593472005 | 4
2023-09-28 00:00:00 | 00:30:00 | 26 | 3212 | 1640757311 | 70312526765 | 5
2023-09-28 00:00:00 | 00:30:00 | 66 | 8578 | 1047484505 | 63943121581 | 6
2023-09-28 00:00:00 | 00:30:00 | 3 | 2783 | 167416738 | 58159595285 | 7
2023-09-28 00:00:00 | 00:30:00 | 2 | 4568 | 290666849 | 54144586115 | 8
2023-09-28 00:00:00 | 00:30:00 | 17 | 25296 | 1149896605 | 35847681517 | 9
2023-09-28 00:00:00 | 00:30:00 | 4 | 12 | 582563699 | 27602121737 | 10
2023-09-28 00:30:00 | 00:30:00 | 26 | 3470 | 5608569787 | 264871859641 | 1
2023-09-28 00:30:00 | 00:30:00 | 17 | 8229 | 4125175834 | 159708538857 | 2
Is there any other more optimal way to do it, eliminating the subqueries and with statements?