How do I calculate a running total over a specific time period?

Question

How do I calculate a running total over a specific time period?

64 Views Asked by pyspark-lab At 06 January 2024 at 12:43

I have a data frame with 3 columns, presented below.

I would like to calculate potential risk of each material. For each material, the potential risk starts with the date corresponding to the first non-zero value in cut column and ends with the date after which there are three zero values in cut column. The Cut_Total is calculated by taking the sum of cut values within this time frame.

Material	Date	Cut
Soap	01/01/2024	1
Soap	01/02/2024	5
Soap	01/03/2024	0
Soap	01/04/2024	0
Soap	01/05/2024	0
Soap	01/06/2024	0
Soap	01/07/2024	2
Soap	01/08/2024	0
Soap	01/09/2024	2
Soap	01/10/2024	0
Brush	01/01/2024	0
Brush	01/02/2024	0
Brush	01/03/2024	2
Brush	01/04/2024	3
Brush	01/05/2024	0
Brush	01/06/2024	0
Brush	01/07/2024	5
Brush	01/08/2024	4
Brush	01/09/2024	0
Brush	01/10/2024	1

I was trying to do this using window functions but could not achieve expected results. Please help me with the approach/codes to solve this. Thanks!

I tried this:

window = Window.partitionBy("Material").orderBy("Date")

# Define a window to look forward up to three rows
forward_window = window.rowsBetween(0, 3)

# Add a new column that checks for non-zero 'Cut' values
df = df.withColumn("non_zero_cut", F.when(F.col("Cut") > 0, True).otherwise(False))

# Add a column to count consecutive zeros using forward-looking window
df = df.withColumn("consecutive_zeros", F.sum(F.when(F.col("Cut") == 0, 1).otherwise(0)).over(forward_window))

# Find rows where risk period starts (first non-zero 'Cut')
df = df.withColumn("start_of_risk", F.when((F.col("non_zero_cut") == True) & (F.lag("Cut", 1).over(window) == 0) | (F.lag("Cut", 1).over(window).isNull()), F.col("Date")).otherwise(None))

# Fill down 'start_of_risk' to propagate the start date
df = df.withColumn("Risk_Start_Date", F.max("start_of_risk").over(window.rowsBetween(Window.unboundedPreceding, 0)))

# Identify the end of the risk period (three consecutive 'Cut' values of zero)
df = df.withColumn("end_of_risk", F.when(F.col("consecutive_zeros") >= 3, F.lag("Date", -3).over(window)).otherwise(None))

# Fill down 'end_of_risk' to propagate the end date
df = df.withColumn("Risk_End_Date", F.max("end_of_risk").over(window.rowsBetween(Window.unboundedPreceding, 0)))

# Filter to keep only rows where risk period ends
df = df.filter(F.col("Risk_End_Date").isNotNull())

# Calculate 'Cut_Total' by summing 'Cut' values between 'Risk_Start_Date' and 'Risk_End_Date'
cut_risk_summary = df.groupBy("Material", "Risk_Start_Date", "Risk_End_Date").agg(F.sum("Cut").alias("Cut_Total"))

cut_risk_summary.display()

The result obtained is below, but it is not the intended result:

Material	Risk_Start_Date	Risk_End_Date	Cut_Total
Soap	01/01/2024	01/05/2024	5
Soap	01/01/2024	01/06/2024	0
Soap	01/01/2024	01/07/2024	0
Soap	01/01/2024	01/08/2024	0
Soap	01/07/2024	01/08/2024	2
Soap	01/09/2024	01/08/2024	2

The expected result of this exercise is as follows:

Material	Cut_Total	Risk_Start_Date	Risk_End_Date
Soap	6	01/01/2024	01/02/2024
Soap	4	01/07/2024	01/09/2024
Brush	15	01/03/2024	01/10/2024

Original Q&A

There are 1 best solutions below

**Lajos Arpad** · Answer 1 · 2024-01-06T13:11:39.660000

Let's think this through. I will use native SQL in this answer, which you can then use as it is or break up into function calls. First, let's understand what are our risk start dates:

select main.Material, main.Date
from yourtable main
left join yourtable side
on main.Material = side.Material and
   date_sub(main.Date, interval 1 day) = side.Date
where side.Material is null

Here we find the record for which does not exist a pair with the same material with the Date being the previous day.

Similarly, risk end dates:

select main.Material, main.Date
from yourtable main
left join yourtable side
on main.Material = side.Material and
   date_add(main.Date, interval 1 day) = side.Date
where side.Material is null

Okay. So, our records will result from querying the table, joining the start and the end days with no overlaps:

select yourtable.Material, start_date.Date as Risk_Start_Date, end_date.Date as Risk_EndDate, sum(yourtable.Cut) as Cut_Total
from yourtable
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date
    where side.Material is null
) start_date
on yourtable.Date >= start_date.Date
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date
    where side.Material is null
) end_date
on yourtable.Date <= end_date.Date
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date
    where side.Material is null
) nonexistent_start_date
on nonexistent_start_date between date_add(start_date.Date interval 1 day) and date_sub(end_date.Date interval 1 day)
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date
    where side.Material is null
) nonexistent_end_date
on nonexistent_end_date between date_add(start_date.Date interval 1 day) and date_sub(end_date.Date interval 1 day)
where (nonexistent_start_date.Material is null) and
      (nonexistent_end_date.Material is null)
group by yourtable.Material, start_date.Date, end_date.Date

We apply the same logic here, we join the start date and the end date to our records, then, in order to exclude overextended intervals, we left join with supposed in-between start dates and end dates and those intervals which have no further start and end dates in-between are the actual results we are seeking. Now that this is clear, we group the results so we can compute the sum for cut.

Of course, this solution can be greatly simplified if we create temporary tables for start and end dates and insert-select the appropriate records and use those instead of the subselections here.

EDIT

This is how I converted your script data into row insertions:

function getData(Material, dt, Cut) {
    return {Material, dt, Cut};
}

let data = [
        ["Soap", "01/01/2024", 1],
        ["Soap", "01/02/2024", 5],
        ["Soap", "01/03/2024", 0],
        ["Soap", "01/04/2024", 0],
        ["Soap", "01/05/2024", 0],
        ["Soap", "01/06/2024", 0],
        ["Soap", "01/07/2024", 2],
        ["Soap", "01/08/2024", 0],
        ["Soap", "01/09/2024", 2],
        ["Soap", "01/10/2024", 0],
        ["Brush", "01/01/2024", 0],
        ["Brush", "01/02/2024", 0],
        ["Brush", "01/03/2024", 2],
        ["Brush", "01/04/2024", 3],
        ["Brush", "01/05/2024", 0],
        ["Brush", "01/06/2024", 0],
        ["Brush", "01/07/2024", 5],
        ["Brush", "01/08/2024", 4],
        ["Brush", "01/09/2024", 0],
        ["Brush", "01/10/2024", 1]];

let inserts = [];

for (let d of data) {
    let input = getData(d[0], d[1], d[2]);
    let date = new Date(input.dt);
    inserts.push(`
        insert into yourtable(Material, Date, Cut)
        values('${input.Material}', '${date.getFullYear()}-${date.getMonth() + 1}-${date.getDate()}', ${input.Cut});
    `);
};

console.log(inserts.join(''));

This results in these insertions:



        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-1', 1);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-2', 5);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-3', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-4', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-5', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-6', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-7', 2);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-8', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-9', 2);
    
        insert into yourtable(Material, Date, Cut)
        values('Soap', '2024-1-10', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-1', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-2', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-3', 2);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-4', 3);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-5', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-6', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-7', 5);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-8', 4);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-9', 0);
    
        insert into yourtable(Material, Date, Cut)
        values('Brush', '2024-1-10', 1);

I created the table as

create table yourtable(
    Material text,
    Date Date,
    Cut int
);

This way I managed to reproduce the issue and it turned out that I forgot to check for Material match in the on conditions of joins. Here's the fix:

select yourtable.Material, start_date.Date as Risk_Start_Date, end_date.Date as Risk_EndDate, sum(yourtable.Cut) as Cut_Total
from yourtable
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date
    where side.Material is null
) start_date
on yourtable.Date >= start_date.Date and
   yourtable.Material = start_date.Material
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date
    where side.Material is null
) end_date
on yourtable.Date <= end_date.Date and
   yourtable.Material = end_date.Material
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date
    where side.Material is null
) nonexistent_start_date
on nonexistent_start_date.Date between date_add(start_date.Date, interval 1 day) and date_sub(end_date.Date, interval 1 day) and
   nonexistent_start_date.Material = yourtable.Material
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date
    where side.Material is null
) nonexistent_end_date
on nonexistent_end_date.Date between date_add(start_date.Date, interval 1 day) and date_sub(end_date.Date, interval 1 day) and
   nonexistent_end_date.Material = yourtable.Material
where (nonexistent_start_date.Material is null) and
      (nonexistent_end_date.Material is null)
group by yourtable.Material, start_date.Date, end_date.Date;

EDIT

Further improvements:

select yourtable.Material, start_date.Date as Risk_Start_Date, end_date.Date as Risk_EndDate, sum(yourtable.Cut) as Cut_Total
from yourtable
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date and
       side.Cut > 0
    where side.Material is null and main.Cut > 0
) start_date
on yourtable.Date >= start_date.Date and
   yourtable.Material = start_date.Material
join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date and
       side.Cut > 0
    where side.Material is null and main.Cut > 0
) end_date
on yourtable.Date <= end_date.Date and
   yourtable.Material = end_date.Material
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_sub(main.Date, interval 1 day) = side.Date and
       side.Cut > 0
    where side.Material is null and main.Cut > 0
) nonexistent_start_date
on nonexistent_start_date.Date between date_add(start_date.Date, interval 1 day) and end_date.Date and
   nonexistent_start_date.Material = yourtable.Material
left join (
    select main.Material, main.Date
    from yourtable main
    left join yourtable side
    on main.Material = side.Material and
       date_add(main.Date, interval 1 day) = side.Date and
       side.Cut > 0
    where side.Material is null and main.Cut > 0
) nonexistent_end_date
on nonexistent_end_date.Date between start_date.Date and date_sub(end_date.Date, interval 1 day) and
   nonexistent_end_date.Material = yourtable.Material
where (nonexistent_start_date.Material is null) and
      (nonexistent_end_date.Material is null) and
      (yourtable.Cut > 0)
group by yourtable.Material, start_date.Date, end_date.Date;

How do I calculate a running total over a specific time period?

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in MAX

Related Questions in CUMULATIVE-SUM

Related Questions in SUMMARIZE

Trending Questions

Popular # Hahtags

Popular Questions