How to find the elementwise harmonic mean across two Pandas dataframes

2.1k Views Asked by At

Simlarly to this post: efficient function to find harmonic mean across different pandas dataframes I have two Pandas dataframes that are identical in shape and I want to find the harmonic mean of each pair of elements - one from each dataframe in the same location. The solution given in that post was to use a Panel, but that is now deprecated.

If I do this:

import pandas as pd
import numpy as np
from scipy.stats.mstats import hmean

df1 = pd.DataFrame(dict(x=np.random.randint(5, 10, 5), y=np.random.randint(1, 6, 5)))
df2 = pd.DataFrame(dict(x=np.random.randint(5, 10, 5), y=np.random.randint(1, 6, 5)))
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
print(df)

       x  y
DF1 0  9  4
    1  6  4
    2  7  2
    3  5  2
    4  5  2
DF2 0  9  2
    1  7  1
    2  7  1
    3  9  5
    4  8  3

x = df.groupby(level = 1).apply(hmean, axis = None).reset_index()
print(x)
   index         0
0      0  4.114286
1      1  2.564885
2      2  2.240000
3      3  3.956044
4      4  3.453237

I only get one column of values. Why? I was expecting two columns as per the original df, one for the hmean of the x values and one for the hmean of the y values. How can I achieve what I want to do?

2

There are 2 best solutions below

3
On BEST ANSWER

The reason is that you pass axis=None to hmean, which flattens the data. Remember when you do groupby().apply(), the argument is the whole group, e.g. df.loc['DF1']. Just remove axis=None:

x = df.groupby(level = 1).apply(hmean).reset_index()

And you get:

   index                                        0
0      0                 [6.461538461538462, 3.0]
1      1  [5.833333333333333, 2.4000000000000004]
2      2                               [8.0, 3.0]
3      3  [6.857142857142858, 2.4000000000000004]
4      4   [6.461538461538462, 2.857142857142857]

Or you can use agg:

x = df.groupby(level = 1).agg({'x':hmean,'y':hmean})

and get:

          x         y
0  6.461538  3.000000
1  5.833333  2.400000
2  8.000000  3.000000
3  6.857143  2.400000
4  6.461538  2.857143

In the case you have more columns than just x,y:

x = df.groupby(level=1).agg({c:hmean for c in df.columns})
0
On

Just try to remove axis = None parameter.