What I have:
I have a pandas dataframe with columns latitude and longitude which represent the spatial coordinates of the home of people.
This could be an example:
import pandas as pd
data = {
"latitude": [49.5659508, 49.568089, 49.5686342, 49.5687609, 49.5695834, 49.5706579, 49.5711228, 49.5716422, 49.5717749, 49.5619579, 49.5619579, 49.5628938, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.566359, 49.56643220000001, 49.56643220000001, 49.5672061, 49.567729, 49.5677449, 49.5679685, 49.5679685, 49.5688543, 49.5690616, 49.5713705],
"longitude": [10.9873409, 10.9894035, 10.9896749, 10.9887881, 10.9851579, 10.9853273, 10.9912959, 10.9910182, 10.9867083, 10.9995758, 10.9995758, 11.000319, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0003023, 10.9999593, 10.9999593, 10.9935709, 11.0011213, 10.9954016, 10.9982288, 10.9982288, 10.9975928, 10.9931367, 10.9939141],
}
df = pd.DataFrame(data)
df.head(11)
latitude longitude
0 49.565951 10.987341
1 49.568089 10.989403
2 49.568634 10.989675
3 49.568761 10.988788
4 49.569583 10.985158
5 49.570658 10.985327
6 49.571123 10.991296
7 49.571642 10.991018
8 49.571775 10.986708
9 49.561958 10.999576
10 49.561958 10.999576
What I need:
I need to group the people into clusters of cluster size equal to 9. This way I get clusters of neighbors. However, I do not want people with the exact same spatial coordinates to be in the same cluster. Since I have more then 3000 people in my dataset, there are many people (around some hundreds) with the exact same spatial coordinates.
How to cluster the people?: A great algorithm to do the clustering job is k-means-constrained. As explained in this article, the algorithm allows to set the cluster size to 9. It took me a couple of lines to cluster the people.
Problem:
People who live in the same building (with same spatial coordinates) always get clustered into the same cluster since the goal is to cluster people who live close to each other. Therefore I have to find an automatic way, to put these people into a different cluster. But not just any different cluster, but a cluster which contains people who still live relatively close (see figure below).
This figure summarizes my problem:

Background infos:
This is how I cluster the people:
from k_means_constrained import KMeansConstrained
coordinates = np.column_stack((df["latitude"], df["longitude"]))
# Define the number of clusters and the number of points per cluster
n_clusters = len(df) // 9
n_points_per_cluster = 9
# Perform k-means-constrained clustering
kmc = KMeansConstrained(n_clusters=n_clusters, size_min=n_points_per_cluster, size_max=n_points_per_cluster, random_state=0)
kmc.fit(coordinates)
# Get cluster assignments
df["cluster"] = kmc.labels_
# Print the clusters
for cluster_num in range(n_clusters):
cluster_data = df[df["cluster"] == cluster_num]["latitude", "longitude"]
print(f"Cluster {cluster_num + 1}:")
print(cluster_data)
As I mentioned in the comments, you can add a new feature which is different when the lat/long are duplicates. As K-means works by assigning clusters by distance to the cluster centres, adding another feature increases the distance between the duplicate rows (whereas earlier the distance would have been zero).
In this example, I'm just incrementing the 3rd feature by 1, but you might need to try a different scaling factor if you have lots of data and lots of duplicates with it as that will increase the distance between multiple duplicates:
So when you run your function and print all columns, you can see the duplicates are assigned to another cluster: