How to make the script below more efficient? This is a follow-up to my previous post Python nested loop issue
It currently takes the best part of two hours to process input tables consisting in about 15000 and 1500 rows. Manually processing my data in Excel takes me an order of magnitude less time - not ideal!
I understand that iterrows is a bad approach to the problem, and that vectorisation is the way forward, but I am a bit dumbfounded at how it would work in regards to the second for loop.
The following script extract takes two dataframes,
qinsy_file_2segy_vlookup(ignore the naming on that one).
For every row in the qinsy_file_2, it iterates through segy_vlookup to calculate distances between coordinates in each file. If this distance is less than a pre-given value (here named buffer), it will get transcribed to a new dataframe out_df(otherwise it will pass over the row).
# Loop through Qinsy file
for index_qinsy,row_qinsy in qinsy_file_2.iterrows():
# Loop through SEGY navigation
for index_segy,row_segy in segy_vlookup.iterrows():
# Calculate distance between points
if ((((segy_vlookup["CDP_X"][index_segy] - qinsy_file_2["CMP Easting"][index_qinsy])**2) + ((segy_vlookup["CDP_Y"][index_segy] - qinsy_file_2["CMP Northing"][index_qinsy])**2))**0.5)<= buffer:
# Append rows less than the distance modifier to the new dataframe
out_df=pd.concat([out_df,row_qinsy])
break
else:
pass
So far I have read through the following:
- How to iterate over rows in a Pandas DataFrame? (and others of a similar name)
- Looking for faster way to iterate over pandas dataframe
- What is the most efficient way to loop through dataframes with pandas?
- https://www.learndatasci.com/solutions/how-iterate-over-rows-pandas/
- https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac
Extending from the loop vectorization domain you looked into, try keywords pairwise distance calculation, e.g.: How to efficiently calculate euclidean distance matrix for several timeseries
The method below is fully vectorized (no
forloops at all). It passes by Numpy for the 2D intensive part, then returns a pandas dataframe as requested. Calculating distances and filtering rows by <= buffer, with your input data sizes of 1500 * 15000 rows, is accomplished in a fraction of a second.Mock-up data
Let
dfArepresentqinsy_file_2withnA = 15000points of 2D coordinates('xA', 'yA')dfBrepresentsegy_vlookupwithnB = 1500points of 2D coordinates('xB', 'yB')With these given seeds if you wanted to reproduce the process:
One-liner summary
Detailed answer
Each point from
dfAwill be tested for distance < buffer regarding each point fromdfB, to determine if the corresponding row ofdfAshould be selected intodf_outDistance matrix
Every column in the output below stands for a row in
dfA:Now filter by threshold (distance < buffer)
At least one value per column in
Dis enough to select that column fordf_out, i.e. that row indfA.Finally operate row-wise selection on dfA:
In this mock-up example, 5620 rows out of 150000 made it to the selection.
Naturally any additional columns that your actual
dfAmight have will also be transcribed intodf_out.To go further, but this answer already reprsents a dramatic improvement over nested loops, reduce number of distances actually calculated?