Pandas pd.cut with pd.groupby not matching loop aggregation

43 Views Asked by At

I am calculating some statistics (mean value from binned data) for a heat map. I am somewhat skeptical of the results using the pandorable method so I tried recalculating using a nested for loop and dataframe masking. The results are almost identical except for 1 or 2 records.

Aggregation approach # 1

intervals = pd.cut(df['BC'], bins=16, right=True)

df = df.groupby(['A',intervals]).mean().unstack(fill_value=0).stack().reset_index(names=['A', 'BC_bins'])

| A  | BC_bins    | Y Mean |
| ---|------------|--------|
| 0  | (.05,.07]  | 464    |
| 1  | (.07,.09]  | 417    | 
| 2  | (.09,.12]  | 377    |
| 3  | (.12,.14]  | 338    |
| 4  | (.14,.16]  | 309    |
| 5  | (.16,.18]  | 290    |
| 6  | (.18,.20]  | 277    |
| 7  | (.20,.23]  | 268    |
| 8  | (.23,.25]  | 234    |
| 9  | (.25,.27]  | 239    |
| 10 | (.27,.29]  | 233    |
| 11 | (.29,.31]  | 230    |
| 12 | (.31,.34]  | 228    |
| 13 | (.34,.36]  | 226    |
| 14 | (.36,.38]  | 223    |
| 15 | (.38,.40]  | 221    |

approach 2

intervals = pd.cut(df['BC'], bins=16, right=True)

for a in (df['A'].unique()):
    for i in intervals.unique():
        subset = df[(df['BC'] > i.left) & (df['BC'] <= i.right) & (df['A'] == a)]
        subset.mean()  # --> send to excel sheet and compare with approach
| A  | BC_bins    | Y Mean |
| ---|------------|--------|
| 0  | (.05,.07]  | 464    |
| 1  | (.07,.09]  | 417    | 
| 2  | (.09,.12]  | 377    |
| 3  | (.12,.14]  | 338    |
| 4  | (.14,.16]  | 309    |
| 5  | (.16,.18]  | 290    |
| 6  | (.18,.20]  | 277    |
| 7  | (.20,.23]  | 254    |
| 8  | (.23,.25]  | no data?|
| 9  | (.25,.27]  | 239    |
| 10 | (.27,.29]  | 233    |
| 11 | (.29,.31]  | 230    |
| 12 | (.31,.34]  | 228    |
| 13 | (.34,.36]  | 226    |
| 14 | (.36,.38]  | 223    |
| 15 | (.38,.40]  | 221    |

Everything looks good until A=7,8. When I look at the summary stats for approach number 2 for A=8 I don't have any samples in that bin so my results are just NA? I think this is skewing A=7. Hoping someone can shed light on this behavior. I am going to change the number of bins and see if I make the same observations.

Is my implementation of approach #1 valid, It seems like it is. I've seen similar techniques done using pd.cut and pd.groupby in textbooks and in other posts.

Also, I should note the distribution of the samples within the bins are not necessarily normal, but I don't really care for this particular analysis at the moment. I might look at pd.qcut() to see if anything changes. Just trying to make sense between the discrepancies.

0

There are 0 best solutions below