distance metrics for clustering non-normally distributed data

593 Views Asked by kevinafra At 26 August 2019 at 23:04

The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?

Original Q&A

There are 1 best solutions below

Has QUIT--Anony-Mousse On 27 August 2019 at 06:18

Why do you think all distances would assume normal (which btw. is not the same as uniform) data?

Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.

distance metrics for clustering non-normally distributed data

There are 1 best solutions below

Related Questions in CLUSTER-ANALYSIS

Related Questions in DISTANCE

Related Questions in NON-UNIFORM-DISTRIBUTION

Trending Questions

Popular # Hahtags

Popular Questions