Rowise compare two pandas dataframes

77 Views Asked by At

I have two pandas dataframes

flows:
------
sourceIPAddress     destinationIPAddress    flowStartMicroseconds       flowEndMicroseconds 
163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962
104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026
17.254.170.53       163.193.124.133         2021-05-01 07:00:00.077     2021-05-01 07:00:00.083874
18.179.96.152       203.179.250.96          2021-05-01 07:00:00.112     2021-05-01 07:00:00.098296
133.103.144.34      13.154.212.11           2021-05-01 07:00:00.101     2021-05-01 07:00:00.112013

attacks:
--------
datetime                    srcIP           dstIP
2021-05-01 07:00:00.055210  188.67.130.72   133.92.239.153   
2021-05-01 07:00:00.055500  45.100.34.74    203.179.180.153   
2021-05-01 07:00:00.055351  103.113.29.26   163.193.242.75   
2021-05-01 07:00:00.056209  128.215.229.101 163.193.94.194   
2021-05-01 07:00:00.055258  45.111.22.11    163.193.138.139   

I want to check for each row of flows if it matches any row of attacks where

attacks[srcIP] == flows[srcIP] || attacks[srcIP] == flows[destIP]
&&
attacks[destIP] == flows[srcIP] || attacks[destIP] == flows[destIP]
&&
attacks[datetime] between flows[flowStartMicroseconds] and flows[flowEndMicroseconds]

Is there any more efficient way to do this than just iterating over it?

EDIT: The dataframes are quite large. I included the head() of each.

flows = {'sourceIPAddress': {510: '163.193.204.92',
  564: '104.247.103.181',
  590: '17.254.170.53',
  599: '18.179.96.152',
  1149: '133.103.144.34'},
 'destinationIPAddress': {510: '40.8.121.226',
  564: '163.193.124.92',
  590: '163.193.124.133',
  599: '203.179.250.96',
  1149: '13.154.212.11'},
 'flowStartMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113000'),
  564: Timestamp('2021-05-01 07:00:00.074000'),
  590: Timestamp('2021-05-01 07:00:00.077000'),
  599: Timestamp('2021-05-01 07:00:00.112000'),
  1149: Timestamp('2021-05-01 07:00:00.101000')},
 'flowEndMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113962'),
  564: Timestamp('2021-05-01 07:00:00.083874'),
  590: Timestamp('2021-05-01 07:00:00.098296'),
  599: Timestamp('2021-05-01 07:00:00.112013'),
  1149: Timestamp('2021-05-01 07:00:00.101026')}}

attacks = {'datetime': {0: Timestamp('2021-05-01 07:00:00.055210'),
  1: Timestamp('2021-05-01 07:00:00.055500'),
  2: Timestamp('2021-05-01 07:00:00.055351'),
  3: Timestamp('2021-05-01 07:00:00.056209'),
  4: Timestamp('2021-05-01 07:00:00.055258')},
 'srcIP': {0: '188.67.130.72',
  1: '45.100.34.74',
  2: '103.113.29.26',
  3: '128.215.229.101',
  4: '45.111.22.11'},
 'dstIP': {0: '133.92.239.153',
  1: '203.179.180.153',
  2: '163.193.242.75',
  3: '163.193.94.194',
  4: '163.193.138.139'}}
3

There are 3 best solutions below

0
Martin Pichler On BEST ANSWER

Solution: Database

My solution was to import the two dataframes into PostgreSQL and create two new tables for forward and backward IP matches, then UNION ALL them together.

The two single joins are significantly faster than if you would do one giant join.

create table attacks_forward as 
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
    flows
JOIN attacks 
    ON flows."sourceIPAddress" = attacks."srcIP" 
    and flows."destinationIPAddress" = attacks."dstIP"
    and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
    
   
create table attacks_backward as 
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
    flows
JOIN attacks 
    ON flows."sourceIPAddress" = attacks."dstIP" 
    and flows."destinationIPAddress" = attacks."srcIP"
    and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";

create table attacks_flows as 
SELECT * FROM attacks_forward
UNION ALL
SELECT * FROM attacks_backward;
0
Golden Lion On

use a left join merge between the two data frames then look for intersections of data.

2
Y K On

I am not sure about performance but I would proceed as follows.

  1. for this purpose there are only two IP types attack IP and flow IP. so I would reindex the two DFs to have the following format

    flow_df : (flow_IPAddress, flowStartMicroseconds, flowEndMicroseconds)

    attack_df: (attack_IP, datetime)

  2. then I would merge them using inner join (left_on = "flow_IPAddress", right_on = "attack_IP")

  3. then I would query the result to filter only valid timestamps (e.g. using the statement you wrote above.)

then the resulting df would look something like below:


flowIPAddress            attack_IP            flowStartMicroseconds            flowEndMicroseconds            datetime  
163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962 2021-05-01 07:00:00.055210
104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026 2021-05-01 07:00:00.055210

Note: If you want to maintain src and dst IPs you can proceed with the above method but considering each pair individually.