I am calculating some statistics (mean value from binned data) for a heat map. I am somewhat skeptical of the results using the pandorable method so I tried recalculating using a nested for loop and dataframe masking. The results are almost identical except for 1 or 2 records.
Aggregation approach # 1
intervals = pd.cut(df['BC'], bins=16, right=True)
df = df.groupby(['A',intervals]).mean().unstack(fill_value=0).stack().reset_index(names=['A', 'BC_bins'])
| A | BC_bins | Y Mean |
| ---|------------|--------|
| 0 | (.05,.07] | 464 |
| 1 | (.07,.09] | 417 |
| 2 | (.09,.12] | 377 |
| 3 | (.12,.14] | 338 |
| 4 | (.14,.16] | 309 |
| 5 | (.16,.18] | 290 |
| 6 | (.18,.20] | 277 |
| 7 | (.20,.23] | 268 |
| 8 | (.23,.25] | 234 |
| 9 | (.25,.27] | 239 |
| 10 | (.27,.29] | 233 |
| 11 | (.29,.31] | 230 |
| 12 | (.31,.34] | 228 |
| 13 | (.34,.36] | 226 |
| 14 | (.36,.38] | 223 |
| 15 | (.38,.40] | 221 |
approach 2
intervals = pd.cut(df['BC'], bins=16, right=True)
for a in (df['A'].unique()):
for i in intervals.unique():
subset = df[(df['BC'] > i.left) & (df['BC'] <= i.right) & (df['A'] == a)]
subset.mean() # --> send to excel sheet and compare with approach
| A | BC_bins | Y Mean |
| ---|------------|--------|
| 0 | (.05,.07] | 464 |
| 1 | (.07,.09] | 417 |
| 2 | (.09,.12] | 377 |
| 3 | (.12,.14] | 338 |
| 4 | (.14,.16] | 309 |
| 5 | (.16,.18] | 290 |
| 6 | (.18,.20] | 277 |
| 7 | (.20,.23] | 254 |
| 8 | (.23,.25] | no data?|
| 9 | (.25,.27] | 239 |
| 10 | (.27,.29] | 233 |
| 11 | (.29,.31] | 230 |
| 12 | (.31,.34] | 228 |
| 13 | (.34,.36] | 226 |
| 14 | (.36,.38] | 223 |
| 15 | (.38,.40] | 221 |
Everything looks good until A=7,8. When I look at the summary stats for approach number 2 for A=8 I don't have any samples in that bin so my results are just NA? I think this is skewing A=7. Hoping someone can shed light on this behavior. I am going to change the number of bins and see if I make the same observations.
Is my implementation of approach #1 valid, It seems like it is. I've seen similar techniques done using pd.cut and pd.groupby in textbooks and in other posts.
Also, I should note the distribution of the samples within the bins are not necessarily normal, but I don't really care for this particular analysis at the moment. I might look at pd.qcut() to see if anything changes. Just trying to make sense between the discrepancies.