This SQL query takes 90 minutes to run in DuckDB with my million row table. I am using a self join pattern as that's the only way I have found this query to work but I am sure there must be a faster query pattern.
The query starts with a Common Table Expression (CTE) that selects all columns from the
main.openorder_tbltable and assigns a row number to each row based on the "Invoice Date" column.The main query then performs a self join on the CTE where it joins each row (
self) with all previous rows (other) that belong to the same "group" and meet certain conditions related to the "Invoice Date" and "Due Date".The conditions for the join are:
- The "Invoice Date" of the
selfrow is more than 90 days after the "Due Date" of theotherrow. - The "Date Closed" of the
otherrow is more than 90 days after its "Due Date". - The "Due Date" of the
otherrow is within 200 days before the "Invoice Date" of theselfrow.
For each
selfrow, the query counts the number ofotherrows that meet these conditions.The result is a table that includes all columns from the
selfrow and the count ofotherrows for eachselfrow.The output is ordered by the row number of the
selfrow.
This query essentially calculates a rolling count of rows in the same group that meet certain date-related conditions. The self join is necessary because I am comparing each row with all previous rows in the same group.
Example Input Table
| row_number | Invoice Date | Due Date | Date Closed | days | amount | group |
|---|---|---|---|---|---|---|
| 1 | 2022-10-22 | 2022-11-21 | 2024-03-01 | 466 | 111 | B |
| 2 | 2022-10-22 | 2022-11-21 | 2023-03-01 | 100 | 111 | B |
| 3 | 2022-10-24 | 2022-11-23 | 2023-03-03 | 100 | 150 | B |
| 4 | 2022-10-31 | 2022-11-30 | 2023-03-05 | 95 | 300 | A |
| 5 | 2022-11-10 | 2022-12-10 | 2023-02-02 | 54 | 180 | A |
| 6 | 2022-11-21 | 2022-12-21 | 2023-02-04 | 45 | 220 | B |
| 7 | 2022-12-04 | 2023-01-03 | 2024-01-04 | 366 | 210 | B |
| 8 | 2022-12-19 | 2023-01-18 | 2023-01-20 | 2 | 100 | A |
| 9 | 2023-01-01 | 2023-01-31 | 2023-02-20 | 20 | 200 | B |
| 10 | 2023-01-22 | 2023-02-21 | 2023-06-01 | 100 | 280 | B |
| 11 | 2023-02-28 | 2023-03-30 | 2023-04-02 | 3 | 250 | A |
| 12 | 2023-05-14 | 2023-06-13 | 2023-09-01 | 80 | 21 | A |
| 13 | 2023-06-18 | 2023-07-18 | 2023-10-01 | 75 | 456 | A |
| 14 | 2023-07-02 | 2023-08-01 | 2023-08-01 | 0 | 320 | B |
SQL
WITH cte AS (
SELECT
ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS row_number,
"Invoice Date",
"Due Date",
"Date Closed",
"days",
"amount",
"group"
FROM main.openorder_tbl
)
select
first(
columns(self.*)
),
count(other."row_number")
from
cte self left join cte other
on
self."group" = other."group"
and
self."row_number" > other."row_number"
AND
(self."Invoice Date" - other."Due Date") > 90
AND
(other."Date Closed" - other."Due Date") > 90
AND
other."Due Date" > (self."Invoice Date" -200)
group by
self."row_number"
order by
self."row_number"
Output
| first(self.row_number) | first(self."Invoice Date") | first(self."Due Date") | first(self."Date Closed") | first(self."days") | first(self.amount) | first(self."group") | count(other.row_number) |
|---|---|---|---|---|---|---|---|
| 1 | 2022-10-22 | 2022-11-21 | 2024-03-01 | 466 | 111 | B | 0 |
| 2 | 2022-10-22 | 2022-11-21 | 2023-03-01 | 100 | 111 | B | 0 |
| 3 | 2022-10-24 | 2022-11-23 | 2023-03-03 | 100 | 150 | B | 0 |
| 4 | 2022-10-31 | 2022-11-30 | 2023-03-05 | 95 | 300 | A | 0 |
| 5 | 2022-11-10 | 2022-12-10 | 2023-02-02 | 54 | 180 | A | 0 |
| 6 | 2022-11-21 | 2022-12-21 | 2023-02-04 | 45 | 220 | B | 0 |
| 7 | 2022-12-04 | 2023-01-03 | 2024-01-04 | 366 | 210 | B | 0 |
| 8 | 2022-12-19 | 2023-01-18 | 2023-01-20 | 2 | 100 | A | 0 |
| 9 | 2023-01-01 | 2023-01-31 | 2023-02-20 | 20 | 200 | B | 0 |
| 10 | 2023-01-22 | 2023-02-21 | 2023-06-01 | 100 | 280 | B | 0 |
| 11 | 2023-02-28 | 2023-03-30 | 2023-04-02 | 3 | 250 | A | 0 |
| 12 | 2023-05-14 | 2023-06-13 | 2023-09-01 | 80 | 21 | A | 1 |
| 13 | 2023-06-18 | 2023-07-18 | 2023-10-01 | 75 | 456 | A | 0 |
| 14 | 2023-07-02 | 2023-08-01 | 2023-08-01 | 0 | 320 | B | 2 |
I have tried using a window function, but the problem is if I do the window function SQL won't let me use conditionals inside of it, this is of course how it's supposed to work. Maybe there's a faster way by doing a sub query with the window function or some such thing. I am not an SQL expert by any stretch. Unfortunately I am at a loss, I know I need to filter down the table as much as possible before doing the rolling count so it runs fast but not sure if there's a way to do that in this case.
Here is the Explain Analyze output from a larger table so the timings are more realistic, the column names are somewhat different but the logic is the same:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE
---Start CTE with index
WITH cte
AS (SELECT ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS "row_number",
"Doc|PayItem",
"Invoice Date",
"Date Closed",
"Due Date",
"DaysPastDue",
"Gross Amount",
"ParentName"
FROM main.df_CustLedger
)
---select the first value of
select first(columns(self."Doc|PayItem")) as "Doc|PayItem",
COALESCE(count(other."row_number"), 0) as "NumberOfPaidInvoices_std"
-- COALESCE(sum(other."Gross Amount"), 0) as "TotalGrossAmtPaid_std",
-- COALESCE(AVG(other."DaysPastDue"), 0) as "MeanDelay_std"
from cte self
left join cte other
on self."ParentName" = other."ParentName"
and self."row_number" > other."row_number"
and (self."Invoice Date" - other."Due Date") > 90
and (other."Date Closed" - other."Due Date") > 90
and (other."Due Date" > (self."Invoice Date" -200))
group by self."row_number"
order by self."row_number"
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 272.71s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ #3 ASC │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.39s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Doc|PayItem │
│ Doc|PayItem │
│ NumberOfPaidInvoices_std │
│ row_number │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.03s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ first(#1) │
│ first(#2) │
│ count(#3) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (10.87s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ row_number │
│ Doc|PayItem │
│ Doc|PayItem │
│ row_number │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119091014 │
│ (0.88s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BLOCKWISE_NL_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ LEFT │
│ (((Invoice Date - 200) < │
│ Due Date) AND ((row_n... │
│ row_number) AND (((Invoice│
│ Date - Due Date) > 90... ├──────────────┐
│(ParentName = ParentName)))│ │
│ ) │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ 119091014 │ │
│ (962.51s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ row_number ││ row_number │
│ Doc|PayItem ││ Doc|PayItem │
│ Invoice Date ││ Due Date │
│ ParentName ││ ParentName │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 ││ 116421 │
│ (0.00s) ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 ││ #1 │
│ #1 ││ #2 │
│ #2 ││ #3 │
│ #3 ││ #4 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #5 │
│ 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (0.00s) ││ 116421 │
│ ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ WINDOW ││ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ROW_NUMBER() OVER (ORDER BY││((Date Closed - Due Date) >│
│ Invoice Date ASC NULLS ││ 90) │
│ LAST) ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ EC: 1628109 │
│ 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (0.74s) ││ 116421 │
│ ││ (0.01s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ df_CustLedger ││ #0 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #1 │
│ Invoice Date ││ #2 │
│ Doc|PayItem ││ #3 │
│ ParentName ││ #4 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #5 │
│ EC: 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ 1628109 │
│ 1628109 ││ (0.00s) │
│ (0.09s) ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ WINDOW │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ROW_NUMBER() OVER (ORDER BY│
│ Invoice Date ASC NULLS │
│ LAST) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (1.82s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ df_CustLedger │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Invoice Date │
│ Doc|PayItem │
│ Date Closed │
│ Due Date │
│ ParentName │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1628109 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.14s) │
└───────────────────────────┘
I welcome any ideas to optimize the query.
Writing it as a Correlated Subquery maxes out all my CPU cores at 100%.
It executes ~10x faster than the self-join approach. (for a 100_000 row sample)
Further Info