I'm trying to create query (duckdb) that generates a "Rolling Count Very Late (> 90 days) by group" column.
I created this table in excel as a prototype of what I am after:
This formula counts the number of past/prior rows where:
- The 'days' value (column D) is greater than 90,
- The 'group' value (column F) matches the current row's group,
- The 'Date Closed' value is less than the current row's 'Invoice Date'.
| A | B | C | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|---|
| 1 | Invoice Date | Due Date | Date Closed | days | amount | group | Rolling Count Very Late (> 90 days) by group | Rolling Count Very Late (> 90 days) by group formula |
| 2 | 22-Oct-22 | 21-Nov-22 | 1-Mar-23 | 100 | 111 | B | 0 | =COUNTIFS(E$3:E3,">90",G$3:G3,G3,D$3:D3,"<"&B3) |
| 3 | 24-Oct-22 | 23-Nov-22 | 3-Mar-23 | 100 | 150 | B | 0 | =COUNTIFS(E$3:E4,">90",G$3:G4,G4,D$3:D4,"<"&B4) |
| 4 | 31-Oct-22 | 30-Nov-22 | 5-Mar-23 | 95 | 300 | A | 0 | =COUNTIFS(E$3:E5,">90",G$3:G5,G5,D$3:D5,"<"&B5) |
| 5 | 10-Nov-22 | 10-Dec-22 | 2-Feb-23 | 54 | 180 | A | 0 | =COUNTIFS(E$3:E6,">90",G$3:G6,G6,D$3:D6,"<"&B6) |
| 6 | 21-Nov-22 | 21-Dec-22 | 4-Feb-23 | 45 | 220 | B | 0 | =COUNTIFS(E$3:E7,">90",G$3:G7,G7,D$3:D7,"<"&B7) |
| 7 | 4-Dec-22 | 3-Jan-23 | 4-Jan-23 | 1 | 210 | B | 0 | =COUNTIFS(E$3:E8,">90",G$3:G8,G8,D$3:D8,"<"&B8) |
| 8 | 19-Dec-22 | 18-Jan-23 | 20-Jan-23 | 2 | 100 | A | 0 | =COUNTIFS(E$3:E9,">90",G$3:G9,G9,D$3:D9,"<"&B9) |
| 9 | 1-Jan-23 | 31-Jan-23 | 20-Feb-23 | 20 | 200 | B | 0 | =COUNTIFS(E$3:E10,">90",G$3:G10,G10,D$3:D10,"<"&B10) |
| 10 | 22-Jan-23 | 21-Feb-23 | 1-Jun-23 | 100 | 280 | B | 0 | =COUNTIFS(E$3:E11,">90",G$3:G11,G11,D$3:D11,"<"&B11) |
| 11 | 28-Feb-23 | 30-Mar-23 | 2-Apr-23 | 3 | 250 | A | 0 | =COUNTIFS(E$3:E12,">90",G$3:G12,G12,D$3:D12,"<"&B12) |
| 12 | 14-May-23 | 13-Jun-23 | 1-Sep-23 | 80 | 21 | A | 1 | =COUNTIFS(E$3:E13,">90",G$3:G13,G13,D$3:D13,"<"&B13) |
| 13 | 18-Jun-23 | 18-Jul-23 | 1-Oct-23 | 75 | 456 | A | 1 | =COUNTIFS(E$3:E14,">90",G$3:G14,G14,D$3:D14,"<"&B14) |
| 14 | 2-Jul-23 | 1-Aug-23 | 1-Aug-23 | 0 | 320 | B | 3 | =COUNTIFS(E$3:E15,">90",G$3:G15,G15,D$3:D15,"<"&B15) |
The current row piece is the obstacle, not sure how to fit that into the query
I have tried a window function but I cannot get the query to filter out results where "Date Closed" < current row "Invoice Date" because SQL is very set oriented and doesn't really have a nice easy way of doing this. Maybe there's a self join solution or something but that's beyond my ability unfortunately. I would appreciate any help. I have an extremely slow solution python that is basically too slow to even consider using with a large amount of data so hoping an SQL solution is available.
Here's my completely not working SQL
SELECT
"Invoice Date",
"Due Date",
"Date Closed",
"days",
"amount",
"group",
SUM(CASE WHEN days > 90 AND "Date Closed" < "Invoice Date" THEN 1 ELSE 0 END)
OVER (PARTITION BY "group" ORDER BY "Invoice Date" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "Rolling Count Very Late (> 90 days) by group"
FROM
main.example_tbl
ORDER BY
"Invoice Date";
As mentioned, you can use a self join.
Sample data using Python API
Self Join
We used
row_number() over () row_numberto add a row number column.This is used as part of the join condition to match "previous rows".
We can then:
GROUP BYtherow_numbercolumnself.*)row_numbervalues (other.row_number)