Compute similarity of two GPS track lines

156 Views Asked by At

I have ~2500 outdoor activities that I've recorded with a GPS tracker. I'd like to group similar activities automatically. Let's take a look at two related but not exactly similar activities:

Map with GPS track to the beach

Map with GPS track to and then along the beach

One fundamental issue is that I cannot just “zip” the points in each activity and expect them to lie adjacent. Different speeds will quickly break the alignment. So I have created a similarity measure that is based on the minimum distance of a given point from one activity to any of the activity. One can plot this as a quantile distribution for all the points from the short activity to the longer activity:

Quantile distribution

One can see here that over 90 % of the points of the short activity lie within 7.5 meters to the longer activity. Given the distance between GPS measurements and general measurement uncertainty, one can say that these parts overlap. There are some 3 % of points that have a distance of up to 20 m. So still not very far away. This makes sense, the shorter activity is practically a subset of the larger one.

Then one can do this the other way around. Since the longer activity has unique points, the quantile distribution rises much more quickly:

Quantile distribution rising faster

We see that around 30 % of the points have a close correspondence in the short activity. But then the distance only grows, with some points being 1000 m away from the closest point of the shorter activity. This is clearly a different route.

This metric seems to be okay. The problem is that even if I only sample 50 or 100 points from the one activity, I need to check them against all the points of the reference. Sampling both will lead to large gaps and doesn't work. The computation takes a significant fraction of a second.

With ~2500 activities, there are around 6,250,000 pairwise comparisons that I need to make. This would take around 180 hours, which is over a week. When a new activity is added, the incremental work would also be many minutes, which I find way too long.

I have tried to reduce the number of comparisons by checking whether the start and end point are close, but that also takes rather long.

The fundamental problem is that this approach to similarity detection is O(N²) and something like O(N log N) would be much nicer. For this to work I would need to be able to either find a way for transitive comparisons such that I could arrange them in a tree-like structure or perhaps even find a bucketing technique.

One thing one could do is to tile the latitude/longitude space into a fine grid and then create two dimensional buckets according to the starting point. One would compare only against activities that are within the same bucket. However, this will fail when similar activities are within the same bucket. Also it will become O(N²) again because a constant fraction of my activities are trips to the grocery store along the same route.

Is there some method to get this similarity finding down to O(N log N) or even O(N) in the number of activities N = ~2500?

1

There are 1 best solutions below

1
btilly On

The real question here is, "How do I efficiently find the closest value in this set to any value in that set without using brute force?"

The answer to that is to use some data structure designed for it, such as a quadtree. And then use that to do a nearest neighbor search.

You can either program this yourself using a library to implement that data structure, or you can store the data in a database that knows how to do it. For example PostgreSQL has good support for nearest neighbor searches. Assuming that you have PostGIS installed, a correlated query will allow you to achieve O(n log(n)) for your exact approach to this problem.