Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

Question

Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

137 Views Asked by chaitanya At 08 February 2022 at 06:59

I have a dataframe which contains

cust_id|phone|email 
1   678   a
2   NaN   c
3   987   b
4   456   NaN
5   NaN   d
7   456   c

All the cust_ids with either matching phone or email are directly related.eg.cust_id 1 is directly related to 2 and 2 is directly related to 3.

Cust_id 1 is indirectly related to 3- they don't have same phone or email but are related through 2.

I want to club and give a unique number to a group which is directly related or is indirectly related

Desired output:

Cust_id|phone|email | group_no
1     678   a   1
2     NaN   c   2
3     987   b   3
4     456   NaN 2
5     NaN   d   4
7     456   c   2

Obtained output:

Cust_id|phone|email | group_no

1     678   a    1
2     NaN   c    2
3     987   b    3
4     456   NaN  2
5     NaN   d    2
7     456   c    2

how do i do this for a dataset that has 7.5 million rows without compromising on speed.

I used the following code in the picture. enter image description here

Original Q&A

There are 2 best solutions below

**Sanjay Singh** · Answer 1 · 2022-02-08T16:20:11.110000

This looks like the perfect case for using a graph database. If you are interested in that, download Neo4j desktop and we will take it from there. You could google

neo4j-admin bulk import tool

With your database size, I expect it will take about 1 min.

**Sanjay Singh** · Answer 2 · 2022-02-11T07:29:54.833000

Here you go Chaitanya

# custom class sub-classed from Graph for low-memory requirements as you don't need edge weights
    class ThinGraph(nx.Graph):
        all_edge_dict = {"weight": 1}
        def single_edge_dict(self):
            return self.all_edge_dict
            edge_attr_dict_factory = single_edge_dict
    # create a blank Graph object from the Networkx library
    G = ThinGraph()
    # add nodes to G
    nodes_array =[]
    customer_id_phone_edges_array = []
    customer_id_email_edges_array = []
    for _,row in df.iterrows():
        nodes_array.append((row["cust_id"],{"label":"Customer ID"}))
        nodes_array.append((row["phone"],{"label":"Phone"}))
        nodes_array.append((row["email"],{"label":"Email"}))
        # Create your customer_ID, phone graph edges
        customer_id_phone_edges_array.append((row["cust_id"],row["phone"]))
        # Create your customer_ID, email graph edges
        customer_id_email_edges_array.append((row["cust_id"],row["email"]))
    #Add the nodes and edges
    G.add_nodes_from(nodes_array)
    G.add_edges_from(customer_id_phone_edges_array)
    G.add_edges_from(customer_id_email_edges_array)
    # delete objects to free up memory
    del nodes_array
    del customer_id_phone_edges_array
    del customer_id_email_edges_array
    # run the connected components algorithm
    components = nx.connected_components(G)
    comp_dict = {idx: comp for idx, comp in enumerate(components)}
    attr = {n: comp_id for comp_id, nodes in comp_dict.items() for n in nodes}
    nx.set_node_attributes(G,  attr,"component")
    # Examine the result
    G.nodes(data=True)

For your given data, you will get the output

NodeDataView({1: {'label': 'Customer ID', 'component': 0}, 678: {'label': 'Phone', 'component': 0}, 'a': {'label': 'Email', 'component': 0}, 2: {'label': 'Customer ID', 'component': 0}, 'b': {'label': 'Email', 'component': 0}, 3: {'label': 'Customer ID', 'component': 0}, 987: {'label': 'Phone', 'component': 0}, 4: {'label': 'Customer ID', 'component': 1}, 456: {'label': 'Phone', 'component': 1}, 'd': {'label': 'Email', 'component': 1}, 5: {'label': 'Customer ID', 'component': 1}, 654: {'label': 'Phone', 'component': 1}, 7: {'label': 'Customer ID', 'component': 1}, 'f': {'label': 'Email', 'component': 1}})

Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in MATCH

Related Questions in GROUPING

Related Questions in RELATIONSHIP

Related Questions in SCALABLE

Trending Questions

Popular # Hahtags

Popular Questions