Python: Program running Slooooow

93 Views Asked by At

I've got an issue with my Pandas data analysis running very slowly and wondered if someone could help please.

I have 2 data frames I'm trying to compare if the IP address in one frame is part of the network in another frame:

  • VFFrame contains 10 columns, 4798 rows: 'CIDR' column contains the network
  • AssetFrame contains 3368 columns, 115816 rows: 'IPv4s' contains the IP addresses to match in 'CIDR' (as 'IPv4s' can have 1 to 3 possible IP addresses, I've had to split it out and clean them up)

I'm adding the additional 7 columns to the AssetFrame ready to take the new values once matched.

for index, row in AssetFrame.iterrows():
    IPstore = AssetFrame.loc[index, 'ipv4s'].split(',')
    
    for index, row in VFFrame.iterrows():
        net = ipaddress.IPv4Network(VFFrame.loc[index, 'CIDR'])
    
        for i in range(len(IPstore)):
            IPstore[i] = IPstore[i].strip()
            IP = ipaddress.IPv4Address(IPstore[i])
        
            if IP in net:
                row = [IP, net]       # Used to check list of matches to export as CSV,
                TheList.append(row)   # to check my assumption below is correct.

                # All IPs will be in the same network 'CIDR' or there will be no match
                # The columns have already been added to the AssetFrame ready to take the new values
                AssetFrame.loc[index, 'comment'] = VFFrame.loc[index, 'comment']
                AssetFrame.loc[index, 'country'] = VFFrame.loc[index, 'country']
                AssetFrame.loc[index, 'city'] = VFFrame.loc[index, 'city']
                AssetFrame.loc[index, 'site-name'] = VFFrame.loc[index, 'site-name']
                AssetFrame.loc[index, 'site-id'] = VFFrame.loc[index, 'site-id']
                AssetFrame.loc[index, 'vf-device'] = VFFrame.loc[index, 'vf-device']
                AssetFrame.loc[index, 'vlan'] = VFFrame.loc[index, 'vlan']



AssetFrame:
    id            ipv4s             fqdn
0   b564a4        192.168.20.4      [email protected]
1   e454a4        192.168.20.74     [email protected]
2   a454a4        192.168.20.84     [email protected]

VFFrame:
    subnet          mask                CIDR                Comment         vlan
0   192.168.20.0    255.255.255.224     192.168.20.0/26     Blah Blah       101
1   192.168.20.64   255.255.255.240     192.168.20.64/28    Ditto Blah      201

Result Should be:
AssetFrame:
    id        ipv4s             fqdn                    Comment         vlan
0   b564a4    192.168.20.4      [email protected]    Blah Blah       101
1   e454a4    192.168.20.74     [email protected]    Ditto Blah      201
2   a454a4    192.168.20.84     [email protected]   No Match        No Match

Example of the data frames and wanted output:

1

There are 1 best solutions below

3
Glauco On

"divide and conquer" is your friends. First of all we have O(n³) complexity and this is no a big deal. The first step in these cases is to understand where time is lost, i can guess the .loc is the bottleneck. But for a clear analisys i suggest to use a performing tools. I wrote perf_tool that can guide you to the solution:

Some hits

  • Internal loop do scalar assignment when you can do it in vectorial mode. The AssetFrame.loc is very expensive and called too many times.

  • Middle loop do again elaboration for each record, this can be done in vectorial mode using masks

  • Top Loop again iter on each row.

  • Probably the Top and Middle loop can be removed doing a single merge operation between the DataFrames.