I want to compute the squared euclidean in a fast way, as described here:
What is the fastest way to compute an RBF kernel in python?
Note1: I am only interested in the distance, not the RBF kernel.
Note2: I am neglecting numexpr here and only use numpy directly.
In short, I compute:
|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x @ y.T)
I am able to compute the distance matrix faster by a factor of ~10 compared to scipy.pdist with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7, which should be exactly zero (i.e. duplicated points or distance to self point).
Question:
Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist) in the first place?
Example:
This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).
import numpy as np
M = np.random.rand(1000, 10)
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
unique = np.unique(np.diag(res)) # analytically all diag values are exactly zero
sqrt_unique = np.sqrt(unique)
print(unique)
print(sqrt_unique)
Example output:
[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
0.00000000e+00 4.44089210e-16 8.88178420e-16 1.77635684e-15
3.55271368e-15]
[ nan nan nan nan
0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
5.96046448e-08]
As you can see some values are also negative (which results in nan after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08)
as per my comment, using
absis probably your best option for cleaning up the numerical stability inherent in this algorithm. as you're concerned about performance you should probably be using the mutating assignment operators as they cause less garbage to be created and hence can be much faster. also, when running this with many features (e.g. 10k) I seepdistbeing slower than this implementation.putting the above together we get:
timing this in IPython with:
I get:
and no errors/warnings from
sqrtthe linked question also points to scikit-learn as having good distance kernel good implementations, the euclidean one being
pairwise_distanceswhich benchmarks as:which might be nice to use if you're already using that package