I have a time series dataframe that I want to generate a smoothed correlation matrix on.
An example:
import numpy as np
import pandas as pd
import datetime as dt
np.random.seed(0)
df = pd.DataFrame(data=np.random.randn(100,3),columns=['Apple','Banana','Orange'],index=pd.date_range(start=dt.datetime(2023,1,1),periods=100))
Then I generate a rolling series of correlation matrices, exponentially weighted with a span of 20:
ewm_corr = df.ewm(span=20).corr()
Then I want to smooth the data across all the corresponding data points in each of the correlation matrices across time.
I expected that the following code would do that:
ewm_corr_smoothed = ewm_corr.ewm(span=20).mean()
However, it does not produce the data I expect. Here is what I expect for the Apple and Orange data points across time. First I extract the Apple and Orange correlation data points across time and then apply the smoothing:
ewm_corr.unstack()['Apple','Orange'].ewm(span=20).mean()
>>>
2023-01-01 NaN
2023-01-02 NaN
2023-01-03 NaN
2023-01-04 NaN
2023-01-05 NaN
...
2023-04-06 0.017641
2023-04-07 0.025754
2023-04-08 0.037171
2023-04-09 0.047193
2023-04-10 0.058412
Freq: D, Name: (Apple, Orange), Length: 100, dtype: float64
If I then check the data from the first method, here is what I get:
ewm_corr_smoothed.unstack()['Apple','Orange']
>>>
2023-01-01 NaN
2023-01-02 0.265612
2023-01-03 0.396163
2023-01-04 0.360363
2023-01-05 0.348585
...
2023-04-06 0.223486
2023-04-07 0.235869
2023-04-08 0.244261
2023-04-09 0.249865
2023-04-10 0.254088
Freq: D, Name: (Apple, Orange), Length: 100, dtype: float64
The data points are significantly different, so I assume the code is deploying a different calculation. I am trying to generate matrices of smoothed correlation data points across time in line with the example for the expected data i.e. where 2023-04-10 has the value 0.058412. I assume this must be possible.
I hope that is clear. Thanks!
Update:
It seems like this achieves the goal:
ewm_corr_smoothed = df.ewm(span=25).corr().unstack().ewm(span=25).mean().stack()
Now when I check, the Apple and Orange data, I get the data as expected:
ewm_corr_smoothed.unstack()['Apple','Orange']
>>>
2023-01-02 -1.000000
2023-01-03 -0.642511
2023-01-04 -0.659027
2023-01-05 -0.659796
2023-01-06 -0.599816
...
2023-04-06 0.031091
2023-04-07 0.035798
2023-04-08 0.042619
2023-04-09 0.048757
2023-04-10 0.055900
Freq: D, Name: (Apple, Orange), Length: 99, dtype: float64
However, I remain interested in an explanation as to what pandas is actually doing in the first method. Thanks!