Just an example... I have a 5M row dataframe full of runners and a log of how many miles they ran on any given day. If a runner doesn't run on a day, there is no 0 entry.
| Runner | Day | Miles |
|---|---|---|
| A | 1 | 3 |
| A | 3 | 4 |
| A | 8 | 4 |
| A | 9 | 2 |
I'm trying to get a field for how much they ran in the last 7 calendar days (not including the current row), as well as the next 7 calendar days (including current row).
| Runner | Day | Miles | Last 7 | Next 7 |
|---|---|---|---|---|
| A | 1 | 3 | null | 7 |
| A | 3 | 4 | 3 | 10 |
| A | 8 | 4 | 7 | 6 |
| A | 9 | 2 | 10 | 2 |
I think can get the answer for a single row.
Next 7 for row 1 for example...
df.loc[(df['Runner'] == 'A') & (df['day'] >= 1) & (df['day'] < 1+7)]['Miles'].sum()
I'm just unsure how to apply this type of thing to every row in an efficient way. The methods I've seen so far only utilize data from the same row. Any help?
you may use rolling sum.
Let's say the maximum day number is 20
And these are the lists of all runners and all days
Then we can get a complete list of runner and day combinations
In the 1st line of the following 4-line code, we fill the zeros back, then in the 2nd we group the data by runner and in the 3rd we apply rolling sum on
Milesrequiring at least only 1 record to do asum. You may adjust this parameter. The last line takes care of the format.We only need to do 7-day sum once, and shift the records for 'last-7-day' and 'next-7-day' by assigning two new, shifted
Daycolumns to replace the original one, and merge the results.