Thank you for reading this. Currently I have a lot of latitude and longitude for many locations, and I need to create a matrix of distances for locations within 10km. (It's okay to fill the matrix with 0 distances between locations far more than 10km).
Data looks like:
place_coordinates=[[lat1, lon1],[lat2,lat2],...]
In this case, I'm using the code below to calculate it, but it takes so long time.
place_correlation = pd.DataFrame(
squareform(pdist(place_coordinates, metric=haversine)),
index=place_coordinates,
columns=place_coordinates
)
When using squareform, I do not know how to not save or not calculate if it is outside 10km.
What is the fastest way?
Thank you in advance!
First of all, do you need to use
haversinemetric for distance calculation? Which implementation do you use? If you would use e.g.euclideanmetric your calculation would be faster but I guess you have good reasons why did you choose this metric.In that case it may be better to use more optimal implementation of
haversine(but I do not know which implementation you use). Check e.g. this SO question.I guess you are using
pdistandsquareformfromscipy.spatial.distance. When you look at the implementation that is behind (here) you will find they are using for loop. In that case you could rather use some vectorized implementation (e.g. this one from the linked question above).When you compare times (absolute numbers will differ based on used machine):
That's quite a lot (~60x faster). When you have really long array (how many coordinates are you using?) this can help you a lot.
Finally, you can combine it using your code:
Additional improvement could be to use another metric (e.g.
euclideanthat will be faster) to quickly say which distances are outside 10km and then calculatehaversinefor the rest.