Cummulative substraction of 2 values using SQL or Python

62 Views Asked by At
rake_name origin destination created_At bpc_issue_date departure_time journey_distance bpc_balance_distance_km abc rn logic
ARIL-05 a b 09-02-2024 06:37 08-02-2024 18:30 09-02-2024 13:40 173.4600067 5271 5097.539993 1 5271-173.4600067
ARIL-05 a b 11-02-2024 02:58 08-02-2024 18:30 11-02-2024 04:15 1099.369995 5271 3998 2 5097-1099
ARIL-05 a b 14-02-2024 12:46 08-02-2024 18:30 14-02-2024 15:50 1099.369995 5271 2898 3 3998-1099
ARIL-05 a b 16-02-2024 17:29 08-02-2024 18:30 17-02-2024 02:30 175.6600037 5271 2713 4 2898-175
ARIL-05 a b 17-02-2024 10:54 08-02-2024 18:30 18-02-2024 11:30 1178.880005 5271 1544 5 2713-1178
ARIL-05 a b 20-02-2024 11:44 08-02-2024 18:30 20-02-2024 15:20 1179.369995 5271 364 6 1544-1179

Please look at the logic column for substraction i want.

so i want to update abc column with difference between bpc_balance_distance_km and journey_distance once i get that value i want to substract below abc values with the result of abc - respective journey_distance.

i have tried sql, however i am not sure how can i populate row >1, am able to populate 1st row using case when rn= 1 then bpc_balance_km - journey_distance

i have tried following query however, i awnt to covert this query int databricks sql supported query. this query works

SET @newabc := 0;

WITH cte AS (
 SELECT *,
       CASE WHEN rn=1 THEN @newabc := bpc_balance_distance_km-journey_distance
         ELSE @newabc := @newabc-journey_distance END AS nabc
 FROM derived_abc ORDER BY rn)
 SELECT *, ROUND(nabc) AS final_result
 FROM cte;

any suggastion in python or sql will be helpful. here is DB fiddle link : https://dbfiddle.uk/JBzaylSb i am using databricks sql

2

There are 2 best solutions below

5
FanoFN On BEST ANSWER

Here, try this:

SET @newabc := 0;

WITH cte AS (
 SELECT *,
       CASE WHEN rn=1 THEN @newabc := bpc_balance_distance_km-journey_distance
         ELSE @newabc := @newabc-journey_distance END AS nabc
 FROM derived_abc ORDER BY rn)
 SELECT *, ROUND(nabc) AS final_result
 FROM cte;

In the query above, the value of bpc_balance_distance_km-journey_distance for rn=1 is assigned to the @newabc variable. Then for rows where rn > 1, it'll take the current assigned @newabc value to subtract with the row's journey_distance value.

There'll be a slight different result in which causes by what I already point out in the comment - the inconsistencies of the decimal treatment. You can see the example in this demo fiddle.

1
bisk d On
def calculate_nabc(rn, bpc_balance_distance_km, journey_distance, newabc):
    if rn == 1:
        newabc[0] = bpc_balance_distance_km - journey_distance
    else:
        newabc[0] = newabc[0] - journey_distance
    return newabc[0]

def calculate_final_result(df):
    # Add a row number column
    df['rn'] = df.groupby('rake_name').cumcount() + 1
    
    # Initialize the value of newabc
    newabc = [0.0]

    # Calculate the nabc column using the UDF
    df['nabc'] = df.apply(lambda row: calculate_nabc(row['rn'], row['bpc_balance_distance_km'], row['journey_distance'], newabc), axis=1)

    # Round the nabc values
    df['final_result'] = df['nabc'].round()

    return df
df = calculate_final_result(df)
df.display()