Numerical issues for alternative way to compute (squared) euclidean distance

Question

Numerical issues for alternative way to compute (squared) euclidean distance

365 Views Asked by no_use123 At 19 January 2020 at 10:47

I want to compute the squared euclidean in a fast way, as described here:

What is the fastest way to compute an RBF kernel in python?

Note1: I am only interested in the distance, not the RBF kernel.

Note2: I am neglecting numexpr here and only use numpy directly.

In short, I compute:

|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x @ y.T)

I am able to compute the distance matrix faster by a factor of ~10 compared to scipy.pdist with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7, which should be exactly zero (i.e. duplicated points or distance to self point).

Question:

Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist) in the first place?

Example:

This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).

import numpy as np

M = np.random.rand(1000, 10)

M_norm = np.sum(M**2, axis=1)

res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
unique = np.unique(np.diag(res))  # analytically all diag values are exactly zero 
sqrt_unique = np.sqrt(unique)


print(unique)
print(sqrt_unique)

Example output:

[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
  0.00000000e+00  4.44089210e-16  8.88178420e-16  1.77635684e-15
  3.55271368e-15]
[           nan            nan            nan            nan
 0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
 5.96046448e-08]

As you can see some values are also negative (which results in nan after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08)

Original Q&A

There are 1 best solutions below

**Sam Mason** · Accepted Answer · 2020-01-19T12:16:16.260000

as per my comment, using abs is probably your best option for cleaning up the numerical stability inherent in this algorithm. as you're concerned about performance you should probably be using the mutating assignment operators as they cause less garbage to be created and hence can be much faster. also, when running this with many features (e.g. 10k) I see pdist being slower than this implementation.

putting the above together we get:

import numpy as np

def edist0(M):
    "calculate pairwise euclidean distance"
    M_norm = np.sum(M**2, axis=1)
    res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T    
    return np.sqrt(np.abs(res))

def edist1(M):
    "optimised calculation of pairwise euclidean distance"
    M_norm = np.einsum('ij,ij->i', M, M)
    res = M @ M.T
    res *= -2.
    res += M_norm[:, np.newaxis]
    res += M_norm[np.newaxis, :]
    return np.sqrt(np.abs(res, out=res), out=res)

timing this in IPython with:

from scipy.spatial import distance

M = np.random.rand(1000, 10000)

%timeit distance.squareform(distance.pdist(M))
%timeit edist0(M)
%timeit edist1(M)

I get:

2.82 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
296 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
153 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and no errors/warnings from sqrt

the linked question also points to scikit-learn as having good distance kernel good implementations, the euclidean one being pairwise_distances which benchmarks as:

from sklearn.metrics import pairwise_distances

%timeit pairwise_distances(M)

170 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

which might be nice to use if you're already using that package

Numerical issues for alternative way to compute (squared) euclidean distance

Question:

Example:

Example output:

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in DISTANCE

Related Questions in NUMERICAL-STABILITY

Trending Questions

Popular # Hahtags

Popular Questions