Identifying and resolving the source of NaN's when calculating k-means centroids for image compression in Python

37 Views Asked by At

I am working on an assignment where we are supposed to manually create a k-means clustering algorithm and use it for image compression in python. My code is below. When I run the algorithm for one iteration (just to see how it turns out) I often find that my recalculated centroids have one or more lines with NaN. I do not know why I'm getting NaN and am looking for help understanding what is causing this and how to fix it. My initial though was that I wasn't getting any data points assigned to the cluster which would lead to a divide by zero error, but I'm not certain if this is the case nor am I sure how it's happening and how to fix it.

# Import the necessary libraries
import numpy as np
import os
from os.path import abspath, exists
from PIL import Image
from numpy import asarray
from scipy.sparse import csc_matrix, find

# Read in the data
dirpath = os.getcwd()
image_path = dirpath + '//data/image.bmp'
image_1 = Image.open(image_path)
image_1_array = asarray(image_1)
image_1_array = image_1_array.reshape(-1, image_1_array.shape[-1])

# Randomly initialize cluster centers
k_clusters = 20
centroids = image_1_array[np.random.randint(image_1_array.shape[0], size=(1, k_clusters))[0]]

iterations = 1

for i in range(0,iterations):
    centroids_squared = np.sum(np.power(centroids, 2), axis=1, keepdims=True)
    print(centroids)
    print(centroids_squared)

    # Calculate the difference between data points and centrouds, make assignments
    tmpdiff = (2 * np.dot(centroids, image_1_array.T) - centroids_squared)
    labels = np.argmax(tmpdiff, axis=0)

    # Update centroids
    dp_num = image_1_array.shape[0]
    P = csc_matrix((np.ones(dp_num), (labels, np.arange(0, dp_num, 1))), shape=(k_clusters,dp_num))

    # Count the data points in each cluster center.
    count = P.sum(axis=1)

    # Adjust cluster centers
    centroids = np.array((P.dot(image_1_array) / count))

I have tried different cluster sizes to see if this was the issue as I read elsewhere that this can help if no points are being assigned to a cluster. I still get NaN's in this case. I borrowed code from another question on the same assignment that implements k-means algorithm. The code works just fine for the other question so I'm not sure if I made an error adjusting it for this specific situation. I've used print statements at each step to review the inputs/outputs from step to step to ensure I'm getting the right kind of data and it appears that I am. I've tried looking for other guides on implementing k-means for image compression, however most that I've found use a pre-made k-means function that I am not allowed to use.

0

There are 0 best solutions below