distance metrics for clustering non-normally distributed data

593 Views Asked by At

The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?

1

There are 1 best solutions below

3
Has QUIT--Anony-Mousse On

Why do you think all distances would assume normal (which btw. is not the same as uniform) data?

Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.