Is there an efficient way to compute Hamming distances between every possible column in a binary matrix?

256 Views Asked by Sam At 17 September 2023 at 16:44

In NumPy, the command numpy.corrcoef(X.T) is amazingly efficient at computing correlations between every possible pair of columns in a matrix X. I am looking for a similarly efficient method to compute Hamming distances between every possible column of a binary matrix B. Is there a NumPy method that I can adapt to do this?

I tried using SciPy's spatial.distance.pdist(X, metric = 'hamming'), but it is 100x slower than NumPy's pairwise correlation function.

Following @frank-yellin 's comment, I also tried spatial.distance.pdist(X, metric = 'cityblock'), but this only sped up the calculation by 1.7x - which is great, but I am looking for a ~100x speed up if possible.

import random
import numpy as np
from scipy import spatial
import time

binary_matrix = np.random.randint(0,2,(1000,1500),dtype = 'int32')
start = time.time()
hamming_with_scipy = spatial.distance.pdist(binary_matrix.T, metric = 'hamming')
end = time.time()
print(f'Hamming takes {end-start} seconds with scipy')

start = time.time()
corr_with_numpy = np.corrcoef(binary_matrix.T)
end = time.time()
print(f'Correlation takes {end-start} seconds with numpy')

Output:

Hamming takes 5.301102876663208 seconds with scipy
Correlation takes 0.03205609321594238 seconds with numpy

Original Q&A

There are 1 best solutions below

Dr. V On 17 September 2023 at 19:38

I just used pdist with a custom function my_hamming and decorated that with numba. I get quite accurately the same time usage. There is probably not much potential in using a low level language. I suspect this rather to be an issue of computational complexity, and indeed:

The correlation coefficients are calculated in quadratic time (slope is 2 decades on time axis to 1 decade on size axis for larger sizes), while the distance computing is cubic (slope 3).

I think it applies to most distances, as they require iterating over all elements of the column vectors.

So in conclusion, these algorithms are not comparable. To some degree, you might be able to accelerate this via parallel processing - but only by a constant factor (max #CPUs).

Is there an efficient way to compute Hamming distances between every possible column in a binary matrix?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in SPARSE-MATRIX

Related Questions in HAMMING-DISTANCE

Related Questions in PAIRWISE-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions