In NumPy, the command numpy.corrcoef(X.T) is amazingly efficient at computing correlations between every possible pair of columns in a matrix X. I am looking for a similarly efficient method to compute Hamming distances between every possible column of a binary matrix B. Is there a NumPy method that I can adapt to do this?
I tried using SciPy's spatial.distance.pdist(X, metric = 'hamming'), but it is 100x slower than NumPy's pairwise correlation function.
Following @frank-yellin 's comment, I also tried spatial.distance.pdist(X, metric = 'cityblock'), but this only sped up the calculation by 1.7x - which is great, but I am looking for a ~100x speed up if possible.
import random
import numpy as np
from scipy import spatial
import time
binary_matrix = np.random.randint(0,2,(1000,1500),dtype = 'int32')
start = time.time()
hamming_with_scipy = spatial.distance.pdist(binary_matrix.T, metric = 'hamming')
end = time.time()
print(f'Hamming takes {end-start} seconds with scipy')
start = time.time()
corr_with_numpy = np.corrcoef(binary_matrix.T)
end = time.time()
print(f'Correlation takes {end-start} seconds with numpy')
Output:
Hamming takes 5.301102876663208 seconds with scipy
Correlation takes 0.03205609321594238 seconds with numpy
I just used
pdistwith a custom functionmy_hammingand decorated that withnumba. I get quite accurately the same time usage. There is probably not much potential in using a low level language. I suspect this rather to be an issue of computational complexity, and indeed:The correlation coefficients are calculated in quadratic time (slope is 2 decades on time axis to 1 decade on size axis for larger sizes), while the distance computing is cubic (slope 3).
I think it applies to most distances, as they require iterating over all elements of the column vectors.
So in conclusion, these algorithms are not comparable. To some degree, you might be able to accelerate this via parallel processing - but only by a constant factor (max #CPUs).