Better way to rewrite windowing function

80 Views Asked by At

Production table have more than 32 M records, for testing purpose created synthetic tab which have 6M records. The below query is taking more than 10 min and killed the process in middle. Any suggestion, how can we optimize below query.

select 
    name
    ,sales
    ,sum(sales) over(partition by name order by sales_dated_on rows between unbounded preceding and current row) as running_total
    ,first_value(sales) over(partition by name order by sales_dated_on ) as first_sales
    ,last_value(sales) over(partition by name order by sales_dated_on desc ) as first_sales
    ,sum(sales) over() as total_sales
    ,sum(sales) over(partition by name ) as total_sales_by_name
from demo.window_test wt ;

Explain Plan:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
WindowAgg  (cost=1153310.26..2678815.36 rows=5999981 width=51) (actual time=99673.366..101631.354 rows=6000000 loops=1)                                                       |
  Output: name, sales, (sum(sales) OVER (?)), (first_value(sales) OVER (?)), (last_value(sales) OVER (?)), sum(sales) OVER (?), (sum(sales) OVER (?)), sales_dated_on         |
  Buffers: shared hit=15762 read=22455, temp read=22522984 written=199339                                                                                                     |
  ->  WindowAgg  (cost=1153310.26..2603815.59 rows=5999981 width=43) (actual time=30058.611..96819.860 rows=6000000 loops=1)                                                  |
        Output: name, sales_dated_on, sales, (last_value(sales) OVER (?)), (sum(sales) OVER (?)), (first_value(sales) OVER (?)), sum(sales) OVER (?)                          |
        Buffers: shared hit=15762 read=22455, temp read=22438358 written=156857                                                                                               |
        ->  WindowAgg  (cost=1153310.26..2513815.88 rows=5999981 width=35) (actual time=21230.484..91691.264 rows=6000000 loops=1)                                            |
              Output: name, sales_dated_on, sales, (last_value(sales) OVER (?)), (sum(sales) OVER (?)), first_value(sales) OVER (?)                                           |
              Buffers: shared hit=15762 read=22455, temp read=22373366 written=123159                                                                                         |
              ->  WindowAgg  (cost=1153310.26..2408816.21 rows=5999981 width=31) (actual time=21230.475..32732.391 rows=6000000 loops=1)                                      |
                    Output: name, sales_dated_on, sales, (last_value(sales) OVER (?)), sum(sales) OVER (?)                                                                    |
                    Buffers: shared hit=15762 read=22455, temp read=89153 written=89461                                                                                       |
                    ->  Incremental Sort  (cost=1153310.26..2303816.54 rows=5999981 width=23) (actual time=21230.460..29881.325 rows=6000000 loops=1)                         |
                          Output: name, sales_dated_on, sales, (last_value(sales) OVER (?))                                                                                   |
                          Sort Key: wt.name, wt.sales_dated_on                                                                                                                |
                          Presorted Key: wt.name                                                                                                                              |
                          Full-sort Groups: 9  Sort Method: quicksort  Average Memory: 30kB  Peak Memory: 30kB                                                                |
                          Pre-sorted Groups: 9  Sort Method: external merge  Average Disk: 22207kB  Peak Disk: 22208kB                                                        |
                          Buffers: shared hit=15762 read=22455, temp read=89153 written=89461                                                                                 |
                          ->  WindowAgg  (cost=1019809.47..1139809.09 rows=5999981 width=23) (actual time=20317.487..26224.097 rows=6000000 loops=1)                          |
                                Output: name, sales_dated_on, sales, last_value(sales) OVER (?)                                                                               |
                                Buffers: shared hit=15762 read=22455, temp read=58332 written=58487                                                                           |
                                ->  Sort  (cost=1019809.47..1034809.42 rows=5999981 width=19) (actual time=20317.469..22908.680 rows=6000000 loops=1)                         |
                                      Output: name, sales_dated_on, sales                                                                                                     |
                                      Sort Key: wt.name, wt.sales_dated_on DESC                                                                                               |
                                      Sort Method: external merge  Disk: 176024kB                                                                                             |
                                      Buffers: shared hit=15762 read=22455, temp read=58332 written=58487                                                                     |
                                      ->  Seq Scan on demo.window_test wt  (cost=0.00..98216.81 rows=5999981 width=19) (actual time=0.015..890.826 rows=6000000 loops=1)|
                                            Output: name, sales_dated_on, sales                                                                                               |
                                            Buffers: shared hit=15762 read=22455   

SQL script:

create table demo.window_test(name varchar(20),sales int,sales_dated_on timestamp);

insert into demo.window_test
select 
    --concat('Group' ,id::text),
    concat('Group',floor(random()*(10-1)+1)::INT) as name
    ,floor(random()*(500-1)+1)::INT as sales_amount
    ,now() -'1 days'::interval * random() * ((10000+1)-1)
from 
    generate_series(1,10) as id
    ,generate_series(1,600000) ;
1

There are 1 best solutions below

0
jjanes On
  1. Your use of DBeaver is fundamentally broken. Each time you fetch a new page, it reruns the entire query. Because of your window functions, it does not get to stop early the way some queries can when you only fetch part of the rows. You can probably configure DBeaver not to do this, or you can just use a different tool, or you can work with the EXPLAIN (ANALYZE, BUFFERS) rather than running the query raw (but in that case, you would ignoring the time it takes to send the data over the network and doing whatever the client does with the data once it is sent, but since you ask about window functions and that network/client time is independent of Window functions that is probably OK.) Once you stop using DBeaver in this broken way, do you still have a problem or is the current performance good enough?

  2. Your shown column last_value(sales) over(partition by name order by sales_dated_on desc) seems silly. Using the default framing, the last row in each frame is just the same thing as the current row, barring ties in the ordering column. So this is just an expensive way to duplicate the "sales" column. And if there are ties, then which one of them do you want to return? Including this calculation causes an extra sort to be done. Although this is partially saved by it being only an incremental sort not a full sort. If you can't just get rid of this but rather have to fix the framing, then doing that will probably make it even slower yet, but you might just have to live with that.

  3. first_values with default framing is not implemented very efficiently in PostgreSQL. It does not just cache the first value for each partition and copy it out. Instead, it keeps jumping back and forth between adding the current row to the end of the frame, then jumping to the first row of the frame to read the value. This is particularly inefficient when the frame spills to disk. So increase work_mem to as large as you think is safe to partially ameliorate this.

  4. If you can do an index-only scan to get the data already in order, that could avoid the sort. An index to support this on your current query might be (name, sales_dated_on, sales). But since sorting is only about 1/4 of your run time anyway, there is a limit to how much improvement this can get you.

  5. Why are you doing this at all? How many times do you really want to be fetching 32M records, and what are you going to do with them?