Please can anyone help me navigate through this logic, I have tried using the lag function, but it doesn't correctly with my window.partition("component').Orderby("x"). Thanks!
conditions: if x value is 0 return current_date on new column inb_date_assigned if x > max of cum_inb return null else
- return inb_date[1], where cum_inb = 32 on x until x= 22, because 22 <= 36 (cum_inb)[1],
- return inb_date[2], where cum_inb = 68 on x = 52 because 52 <= 68 but greater than 36 (cum_inb)[1], continue till, where x <= max(cum_inb). otherwise return null
my df:
x | inb_date | cum_inb
0 | 2024-03-25 | 32
6 | 2024-05-01 | 64
14 | 2024-08-07 | 104
22 | null | null
32 | null | null
62 | null | null
82 | null | null
118 | null | null
120 | null | null
expected result: to get values in cum_inb and inb_date_assigned:
x | inb_date | cum_inb | inb_date_assigned
0 | 2024-03-25 | 32 | today's date
6 | 2024-05-01 | 32 | 2024-03-25
14 | 2024-08-07 | 32 | 2024-03-25
22 | null | 32 | 2024-03-25
52 | null | 64 | 2024-05-01
82 | null | 104 | 2024-08-07
118 | null | null | null
120 | null | null | null
I tried something like this but not working:
window_specx = Window.partitionBy("component").orderBy("x")
# Define the current_date as a constant column
current_date_col = F.current_date()
# Apply the conditions directly within withColumn
df_result = final_df.withColumn("inb_date_assigned",
F.when(F.col("x") == 0, current_date_col)
.when((F.col("cum_inb_qty").isNotNull()) & (F.col("cum_inb_qty") - F.col("x") \>= 0),
F.last("inbound_date", ignorenulls=True).over(window_specx)
).otherwise(None))