I've got an issue with my Pandas data analysis running very slowly and wondered if someone could help please.
I have 2 data frames I'm trying to compare if the IP address in one frame is part of the network in another frame:
- VFFrame contains 10 columns, 4798 rows: 'CIDR' column contains the network
- AssetFrame contains 3368 columns, 115816 rows: 'IPv4s' contains the IP addresses to match in 'CIDR' (as 'IPv4s' can have 1 to 3 possible IP addresses, I've had to split it out and clean them up)
I'm adding the additional 7 columns to the AssetFrame ready to take the new values once matched.
for index, row in AssetFrame.iterrows():
IPstore = AssetFrame.loc[index, 'ipv4s'].split(',')
for index, row in VFFrame.iterrows():
net = ipaddress.IPv4Network(VFFrame.loc[index, 'CIDR'])
for i in range(len(IPstore)):
IPstore[i] = IPstore[i].strip()
IP = ipaddress.IPv4Address(IPstore[i])
if IP in net:
row = [IP, net] # Used to check list of matches to export as CSV,
TheList.append(row) # to check my assumption below is correct.
# All IPs will be in the same network 'CIDR' or there will be no match
# The columns have already been added to the AssetFrame ready to take the new values
AssetFrame.loc[index, 'comment'] = VFFrame.loc[index, 'comment']
AssetFrame.loc[index, 'country'] = VFFrame.loc[index, 'country']
AssetFrame.loc[index, 'city'] = VFFrame.loc[index, 'city']
AssetFrame.loc[index, 'site-name'] = VFFrame.loc[index, 'site-name']
AssetFrame.loc[index, 'site-id'] = VFFrame.loc[index, 'site-id']
AssetFrame.loc[index, 'vf-device'] = VFFrame.loc[index, 'vf-device']
AssetFrame.loc[index, 'vlan'] = VFFrame.loc[index, 'vlan']
AssetFrame:
id ipv4s fqdn
0 b564a4 192.168.20.4 [email protected]
1 e454a4 192.168.20.74 [email protected]
2 a454a4 192.168.20.84 [email protected]
VFFrame:
subnet mask CIDR Comment vlan
0 192.168.20.0 255.255.255.224 192.168.20.0/26 Blah Blah 101
1 192.168.20.64 255.255.255.240 192.168.20.64/28 Ditto Blah 201
Result Should be:
AssetFrame:
id ipv4s fqdn Comment vlan
0 b564a4 192.168.20.4 [email protected] Blah Blah 101
1 e454a4 192.168.20.74 [email protected] Ditto Blah 201
2 a454a4 192.168.20.84 [email protected] No Match No Match
"divide and conquer" is your friends. First of all we have O(n³) complexity and this is no a big deal. The first step in these cases is to understand where time is lost, i can guess the .loc is the bottleneck. But for a clear analisys i suggest to use a performing tools. I wrote perf_tool that can guide you to the solution:
Some hits
Internal loop do scalar assignment when you can do it in vectorial mode. The AssetFrame.loc is very expensive and called too many times.
Middle loop do again elaboration for each record, this can be done in vectorial mode using masks
Top Loop again iter on each row.
Probably the Top and Middle loop can be removed doing a single merge operation between the DataFrames.