I am working on a recommendation system project where I use the KMeansClusterer from NLTK to perform clustering on a user-item rating matrix for predicting user ratings. The goal is to compare different numbers of clusters ([2, 5, 10, 20, 40]) using RMSE and MAE as evaluation metrics and perform the clustering based on user ratings, replacing missing values with the average rating.
Here is the relevant parts of my code:
- Pearson distance
## Pearson distance with Nans
from scipy.stats import pearsonr
def PearsonDist(x, y):
if(len(x.shape) < len(y.shape)):# For Cluster Assignment
x = np.expand_dims(x, axis=0).repeat(y.shape[0], axis=0)
nan_or = np.logical_or(np.isnan(x), np.isnan(y))
corr = [pearsonr(x[i, ~nan_or[i]], y[i, ~nan_or[i]])[0] if((~nan_or[i]).sum() >=2) else 1 for i in range(y.shape[0])]
elif(len(x.shape) > len(y.shape)):
y = np.expand_dims(y, axis=0).repeat(x.shape[0], axis=0)
nan_or = np.logical_or(np.isnan(x), np.isnan(y))
corr = [pearsonr(x[i, ~nan_or[i]], y[i, ~nan_or[i]])[0] if((~nan_or[i]).sum() >=2) else 1 for i in range(y.shape[0])]
else:# For training phase
nan_or = np.logical_or(np.isnan(x), np.isnan(y))
if((~nan_or).sum() <2):
return 1
corr = pearsonr(x[~nan_or], y[~nan_or])[0]
return 1 - np.abs(corr)
import nltk
from nltk.cluster.kmeans import KMeansClusterer
# Initialization and data preparation code...
list_RMSE_q2 = []
list_MAE_q2 = []
list_k_q2 = [2, 5, 10, 20, 40]
for i in range(nbre_replis):
list_RMSE_i = []
list_MAE_i = []
# Code to split data into training and validation sets...
for num_cluster in list_k_q2:
data = MUI_numpy_train.copy()
data[np.isnan(data)] = Biais_mat(MUI_numpy_train)[np.isnan(data)]
kclusterer = KMeansClusterer(num_cluster, distance=EuclidDist, repeats=1, avoid_empty_clusters=True)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
centroids = np.array([np.nanmean(MUI_numpy_train[assigned_clusters==k], axis=0) for k in range(num_cluster)])
assigned_clusters_valid = kclusterer.cluster(MUI_numpy_valid, assign_clusters=False)
R_pred = centroids[assigned_clusters_valid]
list_RMSE_i.append(RMSE_mat(R_pred, MUI_numpy_valid))
list_MAE_i.append(MAE_mat(R_pred, MUI_numpy_valid))
list_RMSE_q2.append(np.array(list_RMSE_i))
list_MAE_q2.append(np.array(list_MAE_i))
# Code to calculate and print the final RMSE and MAE...
This block of code is taking significantly longer to execute than expected. The execution time ranges from 2 hours to even longer, which seems excessive for the dataset size and the complexity of the task.
- Dataset Details: The user-item matrix MUI_numpy has dimensions [number of users] x [number of items], with a substantial amount of missing values (NaNs) that I replace with the mean rating per item (using Biais_mat function).
- I want to use K-means clustering to predict the votes (ratings) and calculate the mean squared error (MSE) for different numbers of clusters to determine the optimal cluster count.
- Execution Environment: The code is run in a VSCode Jupyter Notebook on a laptop with an amd ryzen 7 processor and 16GB RAM".
I am looking for advice on how to optimize the execution time or understand why the process is taking so long. Are there any known performance issues with KMeansClusterer in NLTK when used in this manner, or could there be inefficiencies in my approach to data preparation and clustering?