Cumsum in Pandas Dataframe but reset to zero on a specific date every new year

68 Views Asked by At

I have a dataframe that is in descending order of time and indexed by Race_ID and Student_ID.

Race_ID   Date           Student_ID      Mark     
1         1/10/2023      1               5        
1         1/10/2023      2               8        
1         1/10/2023      3               7        
8         1/10/2023      4               4        
8         1/1/2023       1               9        
8         1/1/2023       2               3        
8         1/1/2023       3               5        
8         1/1/2023       4               10       
2         11/9/2022      1               2        
2         11/9/2022      2               4        
2         11/9/2022      3               9        
3         17/4/2022      5               3        
3         17/4/2022      2               4        
3         17/4/2022      3               3        
3         17/4/2022      4               7        
4         1/3/2022       1               4        
4         1/3/2022       2               9        
5         1/1/2021       1               6        
5         1/1/2021       2               1        
5         1/1/2021       3               8        

I want to create a new column Seasonal_Mark which is the cumsum of Mark by each student, with the caveat that the sum is set to zero on every 1 February. So the desired outcome looks like:

Race_ID   Date           Student_ID      Mark     Seasonal_Mark 
1         1/10/2023      1               5        5
1         1/10/2023      2               8        8
1         1/10/2023      3               7        7
8         1/10/2023      4               4        4
8         1/1/2023       1               9        15 (4+2+9)
8         1/1/2023       2               3        20 (4+9+4+3)
8         1/1/2023       3               5        17 (9+3+5)
8         1/1/2023       4               10       17 (7+10)                  
2         11/9/2022      1               2        6  (4+2)
2         11/9/2022      2               4        17 (4+9+4)
2         11/9/2022      3               9        12 (9+3)
3         17/4/2022      5               3        3
3         17/4/2022      2               4        13 (4+9)
3         17/4/2022      3               3        3
3         17/4/2022      4               7        7
4         1/3/2022       1               4        4
4         1/3/2022       2               9        9
5         1/1/2021       1               6        6
5         1/1/2021       2               1        1
5         1/1/2021       3               8        8

Thank you so much in advance.

2

There are 2 best solutions below

0
PaulS On BEST ANSWER

Another possible solution, which creates an adjusted year column that corresponds to the year of the date, if the date is February or later (in the year), and to the year minus 1, if the date is earlier than February (in the year). This new column is afterwards used inside groupby.

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

m = df['Date'].dt.month
y = df['Date'].dt.year

df.assign(
    Seasonal_Mark = (df.assign(
        adjusted_year = np.where(m >= 2, y, y - 1))
    .sort_values(by='Date')
    .groupby(['Student_ID', 'adjusted_year'])['Mark'].cumsum()))

Output:

    Race_ID       Date  Student_ID  Mark  Seasonal_Mark
0         1 2023-10-01           1     5              5
1         1 2023-10-01           2     8              8
2         1 2023-10-01           3     7              7
3         8 2023-10-01           4     4              4
4         8 2023-01-01           1     9             15
5         8 2023-01-01           2     3             20
6         8 2023-01-01           3     5             17
7         8 2023-01-01           4    10             17
8         2 2022-09-11           1     2              6
9         2 2022-09-11           2     4             17
10        2 2022-09-11           3     9             12
11        3 2022-04-17           5     3              3
12        3 2022-04-17           2     4             13
13        3 2022-04-17           3     3              3
14        3 2022-04-17           4     7              7
15        4 2022-03-01           1     4              4
16        4 2022-03-01           2     9              9
17        5 2021-01-01           1     6              6
18        5 2021-01-01           2     1              1
19        5 2021-01-01           3     8              8
0
mozway On

You can use a custom yearly Period starting on the 1st of February as a secondary grouper, for this use to_period with A-JAN as freq:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df['Seasonal_Mark'] = (df.sort_values(by='Date')
                         .groupby(['Student_ID', df['Date'].dt.to_period('A-JAN')])
                       ['Mark'].cumsum()
                      )

Output:

    Race_ID       Date  Student_ID  Mark  Seasonal_Mark
0         1 2023-10-01           1     5              5
1         1 2023-10-01           2     8              8
2         1 2023-10-01           3     7              7
3         8 2023-10-01           4     4              4
4         8 2023-01-01           1     9             15
5         8 2023-01-01           2     3             20
6         8 2023-01-01           3     5             17
7         8 2023-01-01           4    10             17
8         2 2022-09-11           1     2              6
9         2 2022-09-11           2     4             17
10        2 2022-09-11           3     9             12
11        3 2022-04-17           5     3              3
12        3 2022-04-17           2     4             13
13        3 2022-04-17           3     3              3
14        3 2022-04-17           4     7              7
15        4 2022-03-01           1     4              4
16        4 2022-03-01           2     9              9
17        5 2021-01-01           1     6              6
18        5 2021-01-01           2     1              1
19        5 2021-01-01           3     8              8