Pandas: Assign same cluster id to records based on common groups in different columns

104 Views Asked by Alex_Y At 27 September 2023 at 08:01

I have a dataframe like this:

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

df = pd.DataFrame({
  'A': [111,111,111,222,222,222,333,444], 
  'B': [222,222,111,222,333,444,555,666],
  'C': [111,222,555,444,111,333,777,777]
})

I want to create new column 'cluster' and assign same id to records which are connected directly or through common group in one of the columns.
Meaning, here for example, we see that first 3 elements connected by same group in 'A', but they also connected to other records which have same groups '222', '111' in column 'B'. And all records which have '111', '222', '555' in column 'C'.
So basically, all first 6 elements should have same cluster Id.

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

Records 4-6 are connected to 1-3 as they form a group in column A and they are connected to previous records through columns B and C.

I was playing with multiple consequent apply functions on pairs of columns, but now thinking of applying connected components here, but can't figure out how to do that.

Also, the main problem is that this dataset is huge, > 30 000 000 records.

Appreciate any help.

Original Q&A

There are 1 best solutions below

mozway On 27 September 2023 at 08:18 BEST ANSWER

You can indeed achieve this using networkx.connected_components, after reshaping your DataFrame to build the consecutive pairs of edges. This won't be so fast with 33M rows though:

import networkx as nx

tmp = (df
   .melt(ignore_index=False, value_name='source')
   .assign(target=lambda d: d.groupby(level=0)['source'].shift())
   .dropna(subset=['source', 'target'])
   .drop_duplicates(subset=['source', 'target'])
)

G = nx.from_pandas_edgelist(tmp)

groups = {n: next(iter(g)) for g in nx.connected_components(G) for n in g}

df['cluster'] = df['A'].map(groups)

Output:

     A   B    C cluster
0  111  02  001     002
1  111  02  002     002
2  111  01  005     002
3  222  02  004     002
4  222  03  001     002
5  222  04  003     002

Graph:

considering identical values in different columns as different nodes:

There are several options, an easy one is to use strings and to prefix the column name in the node

tmp1 = df.astype(str).radd([f'{c}|' for c in df])

tmp = (tmp1
   .melt(ignore_index=False, value_name='source')
   .assign(target=lambda d: d.groupby(level=0)['source'].shift())
   .dropna(subset=['source', 'target'])
   .drop_duplicates(subset=['source', 'target'])
)

G = nx.from_pandas_edgelist(tmp)

groups = {n: i for i, g in enumerate(nx.connected_components(G), start=1)
          for n in g}

df['cluster'] = tmp1['A'].map(groups)

Output:

     A    B    C  cluster
0  111  222  111        1
1  111  222  222        1
2  111  111  555        1
3  222  222  444        1
4  222  333  111        1
5  222  444  333        1
6  333  555  777        2
7  444  666  777        2

Graph:

Pandas: Assign same cluster id to records based on common groups in different columns

There are 1 best solutions below

considering identical values in different columns as different nodes:

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in GRAPH

Related Questions in GROUP-BY

Related Questions in CONNECTED-COMPONENTS

Trending Questions

Popular # Hahtags

Popular Questions

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2