How do I split-apply-combine on a dataframe with multi-index columns?

91 Views Asked by At
df = pd.DataFrame([[100,90,80,70,36,45], [101,78,65,88,55,78], [92,77,42,79,43,32], [103,98,76,54,45,65]], index = pd.date_range(start='2022-01-01' ,periods=4))
df.columns = pd.MultiIndex.from_tuples((("mkf", "Open"), ("mkf", "Close"), ("tdf", "Open"), ("tdf","Close"), ("ghi","Open"), ("ghi", "Close")))
df

            mkf        tdf        ghi
           Open Close Open Close Open Close
2022-01-01  100    90   80    70   36    45
2022-01-02  101    78   65    88   55    78
2022-01-03   92    77   42    79   43    32
2022-01-04  103    98   76    54   45    65

The above data frame is composed of three smaller data frames that are identical in structure (i.e. same columns and indices). You can think of it as a large data frame that has been concatenated from three smaller ones. In reality there are thousands of these smaller df's concatenated together but for simplicity I've kept it at three.

I'd like to be able to use the split-apply-combine principle to create a new column called r in each of the three sub data frames.

The split would be based on df.columns.levels[0]. The apply would be np.log(['Close']).diff() The combined data frame would produce the below:

            mkf                  tdf                  ghi
           Open Close         r Open Close         r Open Close         r
2022-01-01  100    90       NaN   80    70       NaN   36    45       NaN
2022-01-02  101    78 -0.143101   65    88  0.228842   55    78  0.550046
2022-01-03   92    77 -0.012903   42    79 -0.107889   43    32 -0.890973
2022-01-04  103    98  0.241162   76    54 -0.380464   45    65  0.708651
1

There are 1 best solutions below

2
mozway On

You could use:

(pd.concat([df, 
            (np.log(df.xs('Close', level=1, axis=1, drop_level=False))
               .rename(columns={'Close': 'r'}, level=1).diff())
            ], axis=1)
  .sort_index(axis=1, level=0, sort_remaining=False)
)

or with a groupby approach:

(df.stack(0)
   .assign(r=lambda d:
           d.groupby(level=1)['Close'].apply(lambda s: np.log(s).diff())
           )
   .unstack(1).swaplevel(axis=1)
   .sort_index(axis=1, level=0, sort_remaining=False)
)

output:

            ghi                  mkf                  tdf                
           Open Close         r Open Close         r Open Close         r
2022-01-01   36    45       NaN  100    90       NaN   80    70       NaN
2022-01-02   55    78  0.550046  101    78 -0.143101   65    88  0.228842
2022-01-03   43    32 -0.890973   92    77 -0.012903   42    79 -0.107889
2022-01-04   45    65  0.708651  103    98  0.241162   76    54 -0.380464