I'm tring to set flow id for network 5-tuple, the original dataframe looks like:
tup = [['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1032']]
df = pd.DataFrame(tup,columns=['src','src_port','dst','dst_port'])
For traffic from the same flow (inbound/outbound), flow id should be set like:
src src_port dst dst_port flow_id
0 192.168.0.1 1032 192.168.0.2 443 1
1 192.168.0.1 1032 192.168.0.2 443 1
2 192.168.0.1 1034 192.168.0.2 443 2
3 192.168.0.2 443 192.168.0.1 1034 2
4 192.168.0.1 1034 192.168.0.2 443 2
5 192.168.0.1 1034 192.168.0.2 443 2
6 192.168.0.2 443 192.168.0.1 1034 2
7 192.168.0.2 443 192.168.0.1 1034 2
8 192.168.0.1 1032 192.168.0.2 443 1
9 192.168.0.2 443 192.168.0.1 1032 1
I converted dataframe to values and sorted them together, but stuck at setting correct flow index.
Is there any faster/elegant way?
One idea is sorted in pairs - nested tuples and then call
factorize: