I have a dataframe which contains
cust_id|phone|email
1 678 a
2 NaN c
3 987 b
4 456 NaN
5 NaN d
7 456 c
All the cust_ids with either matching phone or email are directly related.eg.cust_id 1 is directly related to 2 and 2 is directly related to 3.
Cust_id 1 is indirectly related to 3- they don't have same phone or email but are related through 2.
I want to club and give a unique number to a group which is directly related or is indirectly related
Desired output:
Cust_id|phone|email | group_no
1 678 a 1
2 NaN c 2
3 987 b 3
4 456 NaN 2
5 NaN d 4
7 456 c 2
Obtained output:
Cust_id|phone|email | group_no
1 678 a 1
2 NaN c 2
3 987 b 3
4 456 NaN 2
5 NaN d 2
7 456 c 2
how do i do this for a dataset that has 7.5 million rows without compromising on speed.
I used the following code in the picture.
This looks like the perfect case for using a graph database. If you are interested in that, download Neo4j desktop and we will take it from there. You could google
With your database size, I expect it will take about 1 min.