fuzzy matching millions of records

266 Views Asked by At

I am trying do a fuzzy matching between two columns in which col="co_zip22" will iterate through all the rows in col="co_zip23" and will find a match with a match score So basically co_zip is a unique key which I have created combining company name and zip column and I am trying to find out if a company from 2022 is present in our 2023 recordor not.

I have made a file which consists of two columns containg the co_zip22 and co_zip23 to do the fuzzy match. We don't have any unique identifiers so I can created a string with company name and zip Below is my code and it's working fine for small records but it's keep on running for such a big data set and it has been running for 2 days now

similarity = []
for i in df.co_zip22:#full
        ratio = process.extract( i, df.co_zip23, limit=1)
        similarity.append(ratio[0][1])

df['similarity'] = pd.Series(similarity)
df.head(3)
0

There are 0 best solutions below