I have an application where I need to implement K-means myself. I was trying my implementation in a small program to see how things are going; however, I find it really confusing that I get the exact same cluster centers every time I run the program. I am surprised because my cluster centers are randomly initialized, and I am using FLANN to update the clusters (both should make the output nondeterministic, right?). I'm probably missing something, and would appreciate some help.
cv::Mat kMeans(cv::Mat data, int k, int max_itr){
cv::Mat centroids = cv::Mat(cv::Size(data.cols, k), data.type());
std::vector<int> cluster_ids(data.rows, 0);
//randomly initialize centroids
for(int i = 0; i < k; i++)
{
int idx = rand() % data.rows;
data.row(idx).copyTo(centroids.row(i));
}
int itr = 0;
cv::flann::Index kdtree(centroids, cv::flann::KDTreeIndexParams(4));
cv::Mat indices, dists;
//TODO: add tolerance stopping condition
while (itr < max_itr)
{
//assign data points to centroids
for(int i = 0; i < data.rows; i++)
{
kdtree.knnSearch(data.row(i), indices, dists, 1, cv::flann::SearchParams(32));
cluster_ids[i] = indices.data[0];
}
//update centroids
for(int i = 0; i < centroids.rows; i++)
{
cv::Mat sum = cv::Mat::zeros(cv::Size(data.cols, 1), data.type());
int count = 0;
for(int j = 0; j < data.rows; j++)
{
if(cluster_ids[j] != i) {continue;}
sum += data.row(j);
count++;
}
centroids.row(i) = sum / float(count);
}
itr += 1;
}
return centroids;
}
int main()
{
cv::Mat_<float> data;
cv::Mat_<float> centroids;
cv::Mat dst;
//initialize with dummy data
for(int i = 0; i < 1000; i++)
{
cv::Mat row = (cv::Mat_<float>(1, 3) << float(i), float(i), float(i));
data.push_back(row);
}
centroids = kMeans(data, 20, 20);
std::cout << centroids << std::endl;
return 0;
}
You need this line:
at the start of your program to get different sequences from the
randfunction every time you run the program.You can use some other seed if you want, but you still need to call
srandwith some source of randomness at the start of the program.