I want to calculate the rolling sum of n rows in my dataset where the window size 'n' depends on the sum itself. For example, I want to slide the window as soon as the rolling sum of time exceeds 5 mins. Basically, I want to calculate how much distance the person traveled in the last 5 mins but the time steps are not equally spaced. Here's a dummy data.table for clarity (the last two columns are required):
I am looking for a data.table solution in R
Input data table:
| ID | Distance | Time |
|---|---|---|
| 1 | 2 | 2 |
| 1 | 4 | 1 |
| 1 | 2 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 1 | 6 | 3 |
| 1 | 1 | 1 |
Desired Output:
| ID | Distance | Time | 5.min.rolling.distance | 5.min.rolling.time |
|---|---|---|---|---|
| 1 | 2 | 2 | NA | NA |
| 1 | 4 | 1 | NA | NA |
| 1 | 2 | 1 | NA | NA |
| 1 | 2 | 2 | 10 | 6 |
| 1 | 3 | 3 | 5 | 5 |
| 1 | 6 | 3 | 9 | 6 |
| 1 | 1 | 1 | 10 | 7 |
Here is a solution that works with
doubletime units as well as a simpler solution that will work withintegertime units. I tested thedoublesolution on 10,000 records and on my 2015 laptop it executed instantly. I can't make any guarantees about performance on 40 GB of data.If you wanted to generalize this code I'd look at the RcppRoll package and learn how to implement c++ code in R.
Solution with
doubletime unitsI broke this down into two problems. First, figure out the window size by looking back until we get to at least 5 minutes (or run out of data). Second, take the sum of distances and time from the current observation to the look back unit.
Bad loop code in R usually tries to 'grow' a vector, its a huge efficiency gain to pre-allocate the vector length and then change elements in it.
Solution with
integertime unitsIf your time unit is in integers and your data isn't too big, it may work to
completeyour dataset. This is a little bit of a hack, but here I create a continuostimeidvariable that goes from the starting time to the maximum time, and create one row for each integer unit of time. From there its easy to calculate a rolling cumulative sum for the last five time units. Finally, we get rid of all the fake rows we added in (you want to make sure to do that because they will have invalid cumulative sum data. Also, important to note that I useroll_sumrand notroll_sum;roll_sumrincludes 4 padding NA's on the left side of the output vector for the first 4 units.