I have a dataframe like this:
| index | A | B | C |
|---|---|---|---|
| 1 | 111 | 222 | 111 |
| 2 | 111 | 222 | 222 |
| 3 | 111 | 111 | 555 |
| 4 | 222 | 222 | 444 |
| 5 | 222 | 333 | 111 |
| 6 | 222 | 444 | 333 |
| 7 | 333 | 555 | 777 |
| 8 | 444 | 666 | 777 |
df = pd.DataFrame({
'A': [111,111,111,222,222,222,333,444],
'B': [222,222,111,222,333,444,555,666],
'C': [111,222,555,444,111,333,777,777]
})
I want to create new column 'cluster' and assign same id to records which are connected directly or through common group in one of the columns.
Meaning, here for example, we see that first 3 elements connected by same group in 'A', but they also connected to other records which have same groups '222', '111' in column 'B'. And all records which have '111', '222', '555' in column 'C'.
So basically, all first 6 elements should have same cluster Id.
| index | A | B | C | cluster |
|---|---|---|---|---|
| 1 | 111 | 222 | 111 | 1 |
| 2 | 111 | 222 | 222 | 1 |
| 3 | 111 | 111 | 555 | 1 |
| 4 | 222 | 222 | 444 | 1 |
| 5 | 222 | 333 | 111 | 1 |
| 6 | 222 | 444 | 333 | 1 |
| 7 | 333 | 555 | 777 | 2 |
| 8 | 444 | 666 | 777 | 2 |
Records 4-6 are connected to 1-3 as they form a group in column A and they are connected to previous records through columns B and C.
I was playing with multiple consequent apply functions on pairs of columns, but now thinking of applying connected components here, but can't figure out how to do that.
Also, the main problem is that this dataset is huge, > 30 000 000 records.
Appreciate any help.
You can indeed achieve this using
networkx.connected_components, after reshaping your DataFrame to build the consecutive pairs of edges. This won't be so fast with 33M rows though:Output:
Graph:
considering identical values in different columns as different nodes:
There are several options, an easy one is to use strings and to prefix the column name in the node
Output:
Graph: