Passing sparse distance matrix to AgglomerativeClustering is giving TypeError

96 Views Asked by At

I was getting MemoryError when I imported 100,000 documents to pairwise_distances function. For this reason, I sparsely calculated the distance matrix piece by piece and combined it finally. But AgglomerativeClustering does not take sparse matrix input. What can I do as an alternative?

    ####################
    # SPARSE SIMILARITY MATRIX
    parts = []
    chunk_size = int(len(embeddings) // 10) + 1
    for i in range(10):
        print(i)
        M = pairwise_distances(embeddings[i*chunk_size : (i+1)*chunk_size], embeddings, metric='cosine', n_jobs=-1)
        M[M > 0.35] = 0
        M = sparse.csr_array(M)
        print(M.data.nbytes)
        parts.append(M)
        print('--------')
        
    sm_matrix = sparse.vstack(parts)
    del(parts)
    print(sm_matrix.data.nbytes)
    ####################
    
    print(sm_matrix)
    
    ####################
    # Agglomerative Clustering
    clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1-similarity_threshold,affinity='precomputed',linkage=linkage)
    clustering.fit(sm_matrix)
    if verbose:
        print('Clusters are calculated')
    # clusters created
    ####################

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
0

There are 0 best solutions below