How do I calculate the euclidean distance to the nearest neighbour for each coordinates pair in meters in Pandas dataframe?

121 Views Asked by At

I have a dataframe like this

index place id var_lat_fact var_lon_fact
0 167312091448 5.6679820000 -0.0144950000
1 167312091448 5.6686320000 -0.0157910000
2 167312091448 5.6653530000 -0.0181980000
3 167312091448 5.6700970000 -0.0191400000
4 167312091448 5.6689810000 -0.0104040000

For each coordinates pair (lat, lon) I'd like to calculate the euclidean distance to the nearest neighbour within the dataframe. So each point gets a metric in the additional column (say, nearest_neighbour_dist) indicating that distance in meters.

Something like this

index place id var_lat_fact var_lon_fact nearest_neighbour_dist
0 167312091448 5.6679820000 -0.0144950000 160.588370
1 167312091448 5.6686320000 -0.0157910000 160.588370
2 167312091448 5.6653530000 -0.0181980000 451.525301
3 167312091448 5.6700970000 -0.0191400000 404.794908
4 167312091448 5.6689810000 -0.0104040000 466.104453

Just can't get my head around this... Any help would be greatly appreciated.

2

There are 2 best solutions below

1
mozway On BEST ANSWER

You can use sklearn's NearestNeighbors:

from sklearn.neighbors import NearestNeighbors
from numpy import deg2rad

# set up the nearest neighbors
neigh = NearestNeighbors(n_neighbors=1, metric='haversine')
data = deg2rad(df[['var_lat_fact', 'var_lon_fact']])
neigh.fit(data)

# find the closest two points
# the closest distance is self, the second one is the closest non-self
df['nearest_neighbour_dist'] = (neigh.kneighbors(data,
                                                 n_neighbors=2, return_distance=True
                                                )[0][:, -1]
                                *6371*1000
                               )

Output:

   index      place_id  var_lat_fact  var_lon_fact  nearest_neighbour_dist
0      0  167312091448      5.667982     -0.014495              160.588370
1      1  167312091448      5.668632     -0.015791              160.588370
2      2  167312091448      5.665353     -0.018198              451.525301
3      3  167312091448      5.670097     -0.019140              404.794908
4      4  167312091448      5.668981     -0.010404              466.104453

Points on a map

I wanted to double check the validity of the computations

1 -> 2 (index 0-> 1 in your data) is indeed about 160.6 meters

enter image description here

1
Michael Gruner On

The first thing, you can't compute euclidean distances in the Geographic Coordinate System (longitude and latitude). You need to convert these points to Cartesian Coordinates. Also, are you sure you're looking for the Euclidean distance? Something like the Geodesic distance seems more natural for this problem. The Euclidean distance will give you the distance "through" the earth, while the Geodesic will give you the distance as if you were walking over the curvature of the earth.

Distance to Nearest Neighbor with Euclidean Distance

  1. Convert to Euclidean coordinates
import pandas as pd
import numpy as np

df = pd.read_csv('path_to_your_csv.csv')

earth_radius = 6371000
df['x'] = earth_radius * np.cos(df['var_lat_fact']) * np.cos(df['var_lon_fact'])
df['y'] = earth_radius * np.cos(df['var_lat_fact']) * np.sin(df['var_lon_fact'])
df['z'] = earth_radius * np.sin(df['var_lat_fact'])
  1. Compute the distance between all the points
from scipy.spatial import distance_matrix

# Create a matrix of all points
points = df[['x', 'y', 'z']].to_numpy()

# Compute the distance matrix
dist_matrix = distance_matrix(points, points)

# Set the diagonal to infinity to ignore zero distance to self
np.fill_diagonal(dist_matrix, np.inf)

# Find the minimum distance for each point
df['nearest_neighbor_dist'] = np.min(dist_matrix, axis=1)

# Drop the Cartesian coordinates as they are no longer needed
df = df.drop(['x', 'y', 'z'], axis=1)
  1. The df['nearest_neighbor_dist'] now contains the Euclidean distance to the nearest neighbor.

Distance to Nearest Neighbor with Geodesic Distance

  1. Compute the nearest neighbor distance to each point:
import pandas as pd
from geopy.distance import geodesic

# Convert the latitude and longitude from your DataFrame to a list of (lat, lon) tuples
coordinates = list(zip(df['var_lat_fact'], df['var_lon_fact']))

# Initialize a list to hold the nearest neighbor distances
nearest_neighbor_dists = []

# Calculate the geodesic distance from each point to every other point
for i in range(len(coordinates)):
    distances = [geodesic(coordinates[i], coordinates[j]).meters for j in range(len(coordinates)) if i != j]
    # Keep the smallest one
    nearest_neighbor_dists.append(min(distances))
  1. The df[nearest_neighbor_dist] column now contains the distance to the nearest neighbor in geodesic distance.
df['nearest_neighbor_dist'] = nearest_neighbor_dists