scikit-learn HDBscan throws error when trying to compute medoids/centroids

240 Views Asked by At

I have a precomputed distance matrix that I want to find the medoids for. According to the scikit-learn docs, there's a parameter and attribute that you have to set and call in order to retrieve these medoids. When I set the parameter store_centers="medoid" and call the attribute .medoids_ I receive this error:

Traceback (most recent call last):
  File "C:\Users\Desktop\Clustering\Model.py", line 163, in <module>
    cluster(df, 'test.txt')
  File "C:\Users\Desktop\Clustering\Model.py", line 139, in cluster
    clustering = hdb.fit(distance_matrix.tocsr())
  File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\cluster\_hdbscan\hdbscan.py", line 854, in fit
    self._weighted_cluster_center(X)
 in _weighted_cluster_center
    dist_mat = pairwise_distances(
  File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 2157, in pairwise_distances
    X, _ = check_pairwise_arrays(
  File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 184, in check_pairwise_arrays
    raise ValueError(
ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (9, 2292) for 9 indexed.

I'm unsure as to how my square precomputed matrix is producing a 9x2292 array. Otherwise, the model works fine and I have no issues manually retrieving the medoids through a mse operation. The reason I want to produce the medoid's this way is in hopes of finding the variable eps for each cluster so that I can fit more data to the clusters.

EDIT: My Code with nonreproducible example:

from fuzzywuzzy import fuzz
from sklearn.cluster import HDBSCAN
from scipy.sparse import lil_matrix
import itertools

def dis_matrix(word_list):
    count = 0
    kw_index = {}
    index_kw = {}
    n = len(word_list)
    distance_matrix = lil_matrix((n, n))

    for kw in word_list:
        kw_index[kw] = count
        index_kw[count] = kw
        count += 1

    for x, y in itertools.product(word_list,word_list):
        d = fuzz.ratio(x,y) / 100
        distance = 1 - d if d <= 1 else 0.00000000000001
        index1 = kw_index[x]
        index2 = kw_index[y]
        distance_matrix[index1, index2] = distance
        distance_matrix[index2, index1] = distance

    return distance_matrix, index_kw

CLUSTERING_MIN_SAMPLES = 2
x = ['apple', 'app', 'banana', 'bannana', 'applesauce', 'peaches', 'peach', "appban"]
distance_matrix, index_kw = dis_matrix(x)
hdb = HDBSCAN(cluster_selection_epsilon=.1, metric='precomputed', n_jobs=8, min_samples=CLUSTERING_MIN_SAMPLES,store_centers='medoid')
clustering = hdb.fit(distance_matrix.tocsr())
print(clustering.medoids_)```
0

There are 0 best solutions below