Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

137 Views Asked by At

I have a dataframe which contains

cust_id|phone|email 
1   678   a
2   NaN   c
3   987   b
4   456   NaN
5   NaN   d
7   456   c

All the cust_ids with either matching phone or email are directly related.eg.cust_id 1 is directly related to 2 and 2 is directly related to 3.

Cust_id 1 is indirectly related to 3- they don't have same phone or email but are related through 2.

I want to club and give a unique number to a group which is directly related or is indirectly related

Desired output:

Cust_id|phone|email | group_no
1     678   a   1
2     NaN   c   2
3     987   b   3
4     456   NaN 2
5     NaN   d   4
7     456   c   2

Obtained output:

Cust_id|phone|email | group_no

1     678   a    1
2     NaN   c    2
3     987   b    3
4     456   NaN  2
5     NaN   d    2
7     456   c    2

how do i do this for a dataset that has 7.5 million rows without compromising on speed.

I used the following code in the picture.enter image description here

2

There are 2 best solutions below

5
Sanjay Singh On

This looks like the perfect case for using a graph database. If you are interested in that, download Neo4j desktop and we will take it from there. You could google

neo4j-admin bulk import tool

With your database size, I expect it will take about 1 min.

1
Sanjay Singh On

Here you go Chaitanya

# custom class sub-classed from Graph for low-memory requirements as you don't need edge weights
    class ThinGraph(nx.Graph):
        all_edge_dict = {"weight": 1}
        def single_edge_dict(self):
            return self.all_edge_dict
            edge_attr_dict_factory = single_edge_dict
    # create a blank Graph object from the Networkx library
    G = ThinGraph()
    # add nodes to G
    nodes_array =[]
    customer_id_phone_edges_array = []
    customer_id_email_edges_array = []
    for _,row in df.iterrows():
        nodes_array.append((row["cust_id"],{"label":"Customer ID"}))
        nodes_array.append((row["phone"],{"label":"Phone"}))
        nodes_array.append((row["email"],{"label":"Email"}))
        # Create your customer_ID, phone graph edges
        customer_id_phone_edges_array.append((row["cust_id"],row["phone"]))
        # Create your customer_ID, email graph edges
        customer_id_email_edges_array.append((row["cust_id"],row["email"]))
    #Add the nodes and edges
    G.add_nodes_from(nodes_array)
    G.add_edges_from(customer_id_phone_edges_array)
    G.add_edges_from(customer_id_email_edges_array)
    # delete objects to free up memory
    del nodes_array
    del customer_id_phone_edges_array
    del customer_id_email_edges_array
    # run the connected components algorithm
    components = nx.connected_components(G)
    comp_dict = {idx: comp for idx, comp in enumerate(components)}
    attr = {n: comp_id for comp_id, nodes in comp_dict.items() for n in nodes}
    nx.set_node_attributes(G,  attr,"component")
    # Examine the result
    G.nodes(data=True)

For your given data, you will get the output

NodeDataView({1: {'label': 'Customer ID', 'component': 0}, 678: {'label': 'Phone', 'component': 0}, 'a': {'label': 'Email', 'component': 0}, 2: {'label': 'Customer ID', 'component': 0}, 'b': {'label': 'Email', 'component': 0}, 3: {'label': 'Customer ID', 'component': 0}, 987: {'label': 'Phone', 'component': 0}, 4: {'label': 'Customer ID', 'component': 1}, 456: {'label': 'Phone', 'component': 1}, 'd': {'label': 'Email', 'component': 1}, 5: {'label': 'Customer ID', 'component': 1}, 654: {'label': 'Phone', 'component': 1}, 7: {'label': 'Customer ID', 'component': 1}, 'f': {'label': 'Email', 'component': 1}})