sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

10.1k Views Asked by C-Jay At 27 December 2016 at 10:33

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow.

Is there a faster method to determine the optimal number of cluster?
Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with >300.000 samples and lots of clusters ?

Original Q&A

There are 3 best solutions below

Trishansh Bhardwaj On 27 December 2016 at 10:47 BEST ANSWER

Most common method to find number of cluster is elbow curve method. But it will require you to run KMeans algorithm multiple times to plot graph. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set wiki page mentions some common methods to determine number of clusters.

Has QUIT--Anony-Mousse On 27 December 2016 at 15:01

The silhouette score, while one of the more attractive measures, iw O(n^2). This means, computing the score is much more expensive than computing the k-means clustering!

Furthermore, these scores are only heuristics. They will not yield "optimal" clusterings by any means. They only give a hint on how to choose k, but very often you will find that other k is much better! So don't trust these scores blindly.

DSBLR On 19 May 2022 at 22:20

MiniBatchKmeans is one of the popular option you can try https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in CLUSTER-ANALYSIS

Related Questions in DATA-MINING

Related Questions in BIGDATA

Trending Questions

Popular # Hahtags

Popular Questions