How to find a Pearson correlation starting from two column Pandas DataFrame?

75 Views Asked by At

I have Pandas DataFrame with two columns: CATEGORY (1-400, discrete, categorical) and RESPONSE (0.0-1.0, continuous):

CATEGORY RESPONSE
33       0.000
5        0.005
101      0.125
102      0.423
3        0.003
6        0.75
... etc 55k rows

I first group the DataFrame by category and get the array of RESPONSE for each 1-400 CATEGOR-ies.

I want to calculate Pearson correlation coefficient between arrays for all CATEGORY pairs and show it as, say heatmap, with CATEGORY on horizontal and vertical axes and Pearson value as a color/intensity.

Alternatively, I would like to make a 2D histogram RESPONSE-vs-CATEGORY, binning RESPONSE in 10 bins with width 0.1, and recalculatong the Pearson coefficients.

Google-ing, I cannot find how one goes from 2 column pandas DataFrame to 2D histogram that could be saved.

1

There are 1 best solutions below

0
Johannes Schöck On

Pandas has a built-in function to calculate correlations, pandas.DataFrame.corr. Pearson is the default method for this.

The example from the documentation is similar to what you want to do:

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0

Turning the correlation matrix into a heatmap works very well with seaborn, see stackoverflow. Alternatively, you can format the dataframe using pandas to colorize the different cells according to their value.

import seaborn as sns

corr = df.corr()
sns.heatmap(corr, cmap="Blues", annot=True)