Finding the total probability under a shape in a bivariate KDE plot

Question

Finding the total probability under a shape in a bivariate KDE plot

101 Views Asked by cjmess01 At 27 February 2024 at 18:28

I have a set of points stored as (x,y) values. My goal is the map these onto a coordinate plane and create a continuous PDF distribution.

I would like to apply polygons to the figure and get the total summed probability under the shape and return that to the user. The shape would be stored as a series of coordinates, so [(0,0), (1,0), (1,1),(0,1),(0,0)] would represent a square.

So far, I have plotted the points using a seaborn.kdeplot, and that generates a beautiful plot, which the sum of every point adds to around 100%.

However, I am struggling to effectively apply shapes directly the the graph. I've tried a few online solutions, but have been unable to find any method to get the cumulative probability under a shape directly on the kde plot. My code is below:

def get_field_matrix(csv_name, coordinates):
    
    # ... Some code that loads in a csv and names it df_heatmap
    # df_heatmap has two columns, ActualX and ActualY which are the series of points for the kde
    
    # Create a KDE-based heatmap using seaborn
    kde = sns.kdeplot(x='ActualX', y='ActualY', data=df_heatmap, cmap='viridis', fill=True, cbar=True, weights=np.linalg.norm(df_heatmap, axis=1) ** .3)
    
    # Set labels and title
    plt.xlabel('X-axis')
    plt.ylabel('Y-axis')
    plt.title('KDE-based Heatmap of X, Y Coordinates')

    # Coordinates is the list of tuples that make up the shape to be applied. 
    # Any shape may be represented, but it will always be a shape draw with a single line
    # Creates a matplotlib path out of the coordinates
    shape_path = Path(coordinates)
    shape_patch = PathPatch(shape_path, lw=0)
    plt.gca().add_patch(shape_patch)

    print("Summed Probability over the shape:", "A")

    # Set aspect ratio to be equal
    plt.gca().set_aspect('equal', adjustable='box')

    # Show the plot
    plt.show()
    return kde

Is there some library function I am missing to apply the cdf?

Original Q&A

There are 1 best solutions below

**Muhammed Yunus** · Accepted Answer · 2024-02-28T16:25:52.023000

The example below shows how you can place a patch over a KDE plot, and integrate the area underneath.

I don't think the KDE data from seaborn (left) is directly accessible, so I've run a separate KDE (right) that tries to match seaborn.

The green points represent the user-defined coordinates, and the hatched area is just to confirm that the patch has been correctly interpolated onto the same grid as the KDE data.

A KDE estimator is first fitted to the data, and then used to get a KDE estimate over a fine grid. The user-defined patch coordinates are used to build a triangulated surface, which is then interpolated onto the fine rectangular grid from before. The patch and KDE are now in the same space. They are finally multiplied and integrated over.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

#
# Create test dataset
#
from sklearn.datasets import make_moons
np.random.seed(0)
X, _ = make_moons(n_samples=500, noise=0.1)
X_df = pd.DataFrame({'x0': X[:, 0], 'x1': X[:, 1]})

kde_thresh = 0.05 #default sns clipping

if True: #View datapoints
    sns.kdeplot(
        X_df, x='x0', y='x1', cmap='inferno', alpha=0.4,
        thresh=kde_thresh, cbar=True
    )
    sns.scatterplot(X_df, x='x0', y='x1', color='black', marker='.')
    sns_limits = plt.axis()
    plt.gcf().set_size_inches(6, 3)
    plt.title('seaborn kde plot')
    plt.show()

#
# Model using kernel density estimator
#
from scipy.stats import gaussian_kde
kd_estimator = gaussian_kde(X_df.values.T, bw_method='scott')

#Create grid for plotting estimator results
x_axis = np.linspace(sns_limits[0], sns_limits[1], num=200) #200 is sns kde default
y_axis = np.linspace(sns_limits[2], sns_limits[3], num=200)
x_grid, y_grid = np.meshgrid(x_axis, y_axis)

xy_paired = np.column_stack([x_grid.ravel(), y_grid.ravel()])
kde_scores = kd_estimator.evaluate(xy_paired.T).reshape(x_grid.shape)

from matplotlib import ticker
f, ax = plt.subplots(figsize=(6.2, 3))
c = ax.contour(
    x_grid, y_grid, np.ma.masked_less(kde_scores, kde_thresh),
    cmap='inferno', alpha=0.5
)
ax.scatter('x0', 'x1', data=X_df, marker='.', color='black', s=7)
ax.figure.colorbar(c, ticks=ticker.MaxNLocator(10))
ax.set(xlabel='x0', ylabel='x1')
ax.set_title('manual kde plot')

#
# Place a patch over the KDE
#

#User-defined patch
# patch = np.array( [[-2, -2], [2, -2], [2, 2], [-2, 2]] )     #most of the area
# patch = np.array( [(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)] ) #square
# patch = np.array( [(0,1), (1,1), (1,0), (0,1)] )             #triangle
patch = np.array( [(0,1), (1,1), (1,0), (.8, .8), (0,1)] )   #concave

patch_x = patch[:, 0]
patch_y = patch[:, 1]

#View patch
ax.scatter(patch_x, patch_y, marker='o', color='darkgreen', alpha=0.7)
ax.fill(patch_x, patch_y, color='darkgreen', linestyle='none', alpha=0.5)

#Map patch onto 2D xy grid
import matplotlib.path as mpath
patch_path = mpath.Path(np.column_stack([patch_x, patch_y]))
patch_interp = patch_path.contains_points(
    np.column_stack([x_grid.ravel(), y_grid.ravel()])
).reshape(x_grid.shape)
patch_interp = np.where(patch_interp, 1, 0)

#Visually confirm that the polygon has been correctly mapped onto the rectilinear grid
ax.contourf(x_grid, y_grid, patch_interp, levels=[0.9, 1], hatches=['//'], colors='none')

#
#Finally, extract the KDE portion and integrate its area
#
kde_patch = kde_scores * patch_interp
grid_area = np.diff(x_axis)[0] * np.diff(y_axis)[0]
area = (kde_patch * grid_area).sum()

ax.set_title(r'$\int KDE$ over patch is ' + f'{area:.3f}')

Finding the total probability under a shape in a bivariate KDE plot

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MATPLOTLIB

Related Questions in SEABORN

Related Questions in KERNEL-DENSITY

Trending Questions

Popular # Hahtags

Popular Questions