Mahalanobis distance not equal to Euclidean distance after PCA

808 Views Asked by At

I am trying to compute the Mahalanobis distance as the Euclidean distance after transformation with PCA, however, I do not get the same results. The following code:

import numpy as np
from scipy.spatial.distance import mahalanobis
from sklearn.decomposition import PCA

X = [[1,2], [2,2], [3,3]]

mean = np.mean(X, axis=0)
cov = np.cov(X, rowvar=False)
covI = np.linalg.inv(cov)

maha = mahalanobis(X[0], mean, covI)
print(maha)

pca = PCA()

X_transformed = pca.fit_transform(X)

stdev = np.std(X_transformed, axis=0)
X_transformed /= stdev

print(np.linalg.norm(X_transformed[0]))

prints

1.1547005383792515
1.4142135623730945

To my understanding, the PCA uncorrelates the dimensions, and the division by the standard deviation weights every dimension equally, so the Euclidean distance should equal the Mahalanobis distance. Where am I going wrong?

1

There are 1 best solutions below

2
jylls On BEST ANSWER

According to this discussion, the relationship between PCA and the Mahalanobis distance only holds true with PCA components with unit variance. This can be obtained by applying PCA on the whitened data (more information here).

Once you do that, the Mahalanobis distance in the original space is equal to the euclidean distance in the PCA space. You can see a demonstration of that in the code below:

import numpy as np
from scipy.spatial.distance import mahalanobis,euclidean
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = np.array([[1,2], [2,2], [3,3]])

cov = np.cov(X, rowvar=False)
covI = np.linalg.inv(cov)
mean=np.mean(X)
maha = mahalanobis(X[0], X[1], covI)

pca = PCA(whiten=True)
X_transformed= pca.fit_transform(X)

print('Mahalanobis distance: '+str(maha))
print('Euclidean distance: '+str(euclidean(X_transformed[0],X_transformed[1])))

The output gives:

Mahalanobis distance: 2.0
Euclidean distance: 2.0000000000000004