How to cluster people who live close (but not too close) to each other?

106 Views Asked by At

What I have:

I have a pandas dataframe with columns latitude and longitude which represent the spatial coordinates of the home of people.

This could be an example:

import pandas as pd

data = {
"latitude": [49.5659508, 49.568089, 49.5686342, 49.5687609, 49.5695834, 49.5706579, 49.5711228, 49.5716422, 49.5717749, 49.5619579, 49.5619579, 49.5628938, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.566359, 49.56643220000001, 49.56643220000001, 49.5672061, 49.567729, 49.5677449, 49.5679685, 49.5679685, 49.5688543, 49.5690616, 49.5713705],
"longitude": [10.9873409, 10.9894035, 10.9896749, 10.9887881, 10.9851579, 10.9853273, 10.9912959, 10.9910182, 10.9867083, 10.9995758, 10.9995758, 11.000319, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0003023, 10.9999593, 10.9999593, 10.9935709, 11.0011213, 10.9954016, 10.9982288, 10.9982288, 10.9975928, 10.9931367, 10.9939141],
}
df = pd.DataFrame(data)

df.head(11)

    latitude    longitude
0   49.565951   10.987341 
1   49.568089   10.989403 
2   49.568634   10.989675
3   49.568761   10.988788
4   49.569583   10.985158
5   49.570658   10.985327 
6   49.571123   10.991296
7   49.571642   10.991018
8   49.571775   10.986708
9   49.561958   10.999576
10  49.561958   10.999576

What I need:

I need to group the people into clusters of cluster size equal to 9. This way I get clusters of neighbors. However, I do not want people with the exact same spatial coordinates to be in the same cluster. Since I have more then 3000 people in my dataset, there are many people (around some hundreds) with the exact same spatial coordinates.

How to cluster the people?: A great algorithm to do the clustering job is k-means-constrained. As explained in this article, the algorithm allows to set the cluster size to 9. It took me a couple of lines to cluster the people.

Problem:

People who live in the same building (with same spatial coordinates) always get clustered into the same cluster since the goal is to cluster people who live close to each other. Therefore I have to find an automatic way, to put these people into a different cluster. But not just any different cluster, but a cluster which contains people who still live relatively close (see figure below).

This figure summarizes my problem: enter image description here

Background infos:

This is how I cluster the people:

from k_means_constrained import KMeansConstrained

coordinates = np.column_stack((df["latitude"], df["longitude"]))

# Define the number of clusters and the number of points per cluster
n_clusters = len(df) // 9
n_points_per_cluster = 9

# Perform k-means-constrained clustering
kmc = KMeansConstrained(n_clusters=n_clusters, size_min=n_points_per_cluster, size_max=n_points_per_cluster, random_state=0)
kmc.fit(coordinates)

# Get cluster assignments
df["cluster"] = kmc.labels_

# Print the clusters
for cluster_num in range(n_clusters):
    cluster_data = df[df["cluster"] == cluster_num]["latitude", "longitude"]
    print(f"Cluster {cluster_num + 1}:")
    print(cluster_data)
1

There are 1 best solutions below

2
Suraj Shourie On

As I mentioned in the comments, you can add a new feature which is different when the lat/long are duplicates. As K-means works by assigning clusters by distance to the cluster centres, adding another feature increases the distance between the duplicate rows (whereas earlier the distance would have been zero).

In this example, I'm just incrementing the 3rd feature by 1, but you might need to try a different scaling factor if you have lots of data and lots of duplicates with it as that will increase the distance between multiple duplicates:

# add a new feature
df['feature'] = df.groupby(['latitude', 'longitude']).cumcount()
# just for visually checking prints (can remove)
df['IsDuplicate'] = df.groupby(['latitude', 'longitude'])['feature'].transform('count') > 1
coordinates = np.column_stack((df["latitude"], df["longitude"], df['feature']))

So when you run your function and print all columns, you can see the duplicates are assigned to another cluster:

Cluster 1:
    latitude  longitude  feature  IsDuplicate  cluster
0  49.565951  10.987341        0        False        0
1  49.568089  10.989403        0        False        0
2  49.568634  10.989675        0        False        0
3  49.568761  10.988788        0        False        0
4  49.569583  10.985158        0        False        0
5  49.570658  10.985327        0        False        0
6  49.571123  10.991296        0        False        0
7  49.571642  10.991018        0        False        0
8  49.571775  10.986708        0        False        0
Cluster 2:
     latitude  longitude  feature  IsDuplicate  cluster
10  49.561958  10.999576        1         True        1
12  49.562894  11.000319        1         True        1
18  49.566432  10.999959        1         True        1
19  49.567206  10.993571        0        False        1
21  49.567745  10.995402        0        False        1
23  49.567968  10.998229        1         True        1
24  49.568854  10.997593        0        False        1
25  49.569062  10.993137        0        False        1
26  49.571371  10.993914        0        False        1
Cluster 3:
     latitude  longitude  feature  IsDuplicate  cluster
9   49.561958  10.999576        0         True        2
11  49.562894  11.000319        0         True        2
13  49.563003  10.999100        0        False        2
14  49.563317  10.999382        0        False        2
15  49.563976  11.004145        0        False        2
16  49.566359  11.000302        0        False        2
17  49.566432  10.999959        0         True        2
20  49.567729  11.001121        0        False        2
22  49.567968  10.998229        0         True        2