Pandas Apply Rolling Window to Every Column Efficiently

52 Views Asked by At

I'm running a Monte Carlo simulation. Part of the calculation requires applying a function to a rolling window for each simulation. However, I don't know how to do that efficiently. I'm concerned that this may be a duplicate post, but I was unable to find another like this.

My minimal reproducible example is this:

import pandas as pd
import numpy as np
from scipy.stats import norm

# Number of simulations
trials = 10000

# Generate random variables
df1 = pd.DataFrame(norm.rvs(size = (500, trials)))

f = lambda x: np.sum(x > 0) > 20

# Make a deep copy of df1
df2 = df1.copy(deep = True)

for col in df2.columns:
    
    df2[col] = df2[col].rolling(window = 30).apply(f)

Is there a way to write this without a for-loop or list comprehension? Since the simulations are columns, df2 would ideally have at least 10,000 columns. Having a data frame that's the transpose of this would also be fine. Within my code this part takes about 100 times more than the next longest process in my simulation.

1

There are 1 best solutions below

0
rpanai On BEST ANSWER

Your is not exactly a minimal example. So we first try to see what happen with a minimal example and then we test performaces on the full one.

Data

import pandas as pd
import numpy as np
from scipy.stats import norm

# Number of simulations
trials = 10000

# Generate random variables
df1 = pd.DataFrame(norm.rvs(size = (500, trials)))

Min example

here I both reduce the amount of data and change your function to work with less data

df_min = df1[range(3)][:10]
# backup
df_min_bk = df_min.copy()
f_min = lambda x: np.sum(x > 0) > 2

where df_min is

          0         1         2
0  0.407418  1.741455 -0.270929
1 -0.530294  1.248405  1.201781
2 -1.193793 -0.088235  0.991222
3 -0.941380  0.499053 -0.913778
4  0.951970 -2.073895 -1.179818
5 -1.666666  1.143326  1.266971
6  0.688032 -0.188798 -0.130474
7  0.618970 -0.595450  1.420563
8  1.370329 -0.904624  1.167164
9 -0.571588  0.547064 -1.169145

Run minimal example

Using apply

%%time
for col in df_min.columns:
    
    df_min[col] = df_min[col].rolling(window=3).apply(f_min)
CPU times: user 10.6 ms, sys: 693 µs, total: 11.3 ms
Wall time: 11 ms

and the output is

     0    1    2
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  0.0  0.0  0.0
3  0.0  0.0  0.0
4  0.0  0.0  0.0
5  0.0  0.0  0.0
6  0.0  0.0  0.0
7  0.0  0.0  0.0
8  1.0  0.0  0.0
9  0.0  0.0  0.0

Avoid apply

Set df_min = df_min_bk.copy() and using built-in function we can rewrite the same function as

for col in df_min.columns:
    
    df_min[col] = df_min[col].gt(0).rolling(window=3).sum().gt(2).astype(int)
CPU times: user 1.02 ms, sys: 3.21 ms, total: 4.24 ms
Wall time: 3.9 ms

Which is almost 3x the previous case and the output is still

   0  1  2
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0
4  0  0  0
5  0  0  0
6  0  0  0
7  0  0  0
8  1  0  0
9  0  0  0

which is ok if we remember that the first n-1 columns of a rolling windows should be NaN.

Avoid loop columns

Set again df_min = df_min_bk.copy() we can use the precious function without looping columns

%%time
df_min = df_min.gt(0).rolling(window=3).sum().gt(2).astype(int)
CPU times: user 2.21 ms, sys: 0 ns, total: 2.21 ms
Wall time: 2.22 ms

Which is almost 2x the precious case and 6X the apply one. The output is the same as the previous example.

Full Example

%%time
df2 = df2.gt(0).rolling(window=30).sum().gt(20).astype(int)
CPU times: user 607 ms, sys: 27 ms, total: 634 ms
Wall time: 633 ms

this takes less than a second. While the apply and the loop through columns is taking several minutes

Timing apply

CPU times: user 8min 40s, sys: 150 ms, total: 8min 40s
Wall time: 8min 40s

Compared to the previous method the speedup is 820x.

Conclusion

Play first with a small amount of data you can visualize, then eventually play with few full columns and then with all the data.