sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

10.1k Views Asked by At

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow.

  1. Is there a faster method to determine the optimal number of cluster?
  2. Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with >300.000 samples and lots of clusters ?
3

There are 3 best solutions below

3
Trishansh Bhardwaj On BEST ANSWER

Most common method to find number of cluster is elbow curve method. But it will require you to run KMeans algorithm multiple times to plot graph. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set wiki page mentions some common methods to determine number of clusters.

0
Has QUIT--Anony-Mousse On

The silhouette score, while one of the more attractive measures, iw O(n^2). This means, computing the score is much more expensive than computing the k-means clustering!

Furthermore, these scores are only heuristics. They will not yield "optimal" clusterings by any means. They only give a hint on how to choose k, but very often you will find that other k is much better! So don't trust these scores blindly.

0
DSBLR On

MiniBatchKmeans is one of the popular option you can try https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html