Is there a way to use the fuzzywuzzy python package to exactly match one part of a string and partially match the other?

32 Views Asked by At

I'm trying to determine if one specific business exists in two different datasets. There are a few things that are making this more complicated than simply doing a join in SQL to only return identical rows in the two datasets:

  1. There is no shared ID
  2. The names of the same business are often formatted slightly differently in the two datasets (example: ds1 says JT's Kitchen and ds2 says J.T.'s Kitchen)
  3. There could be two different businesses with identical names

First stack overflow post, so forgive me if I miss on some etiquette.

I'm attempting to use fuzzywuzzy to return a score on how similar the two business names are. However, i was thinking that to get a more accurate score, I could create a string by concatenating the business nam and zip code, then trying to get an exact match on the zip code part and a fuzzy match on the rest - I am just not sure how to separate what types of match i am looking for on which parts. Alternatively could separate zip code and business name into two separate columns. Here is a version of what i have so far, where ds1.ds1_id is the name+zip from one database and ds2.ds2_id is the name+zip from the second database.

I did want to take a look at all the scores to decide which one i ultimately want to use.

Is there a way to make this exact match on the last 5 digits of strings ds1_id and ds2_id and then fuzzy match on the rest, or am i wasting my time?

accountid = []
ds1_id = []
actual_name = []
similarity = []
similarity_partial = []
similarity_set = []
similarity_sort = []

for i in ds1.biz_ID:
        ratio = process.extract(i, ds2.ds2_ID, limit=1)
        partial_ratio = process.extract(i, ds2.ds2_ID, limit=1,scorer=fuzz.partial_ratio)
        token_set_ratio = process.extract(i, ds2.ds2_ID, limit=1,scorer=fuzz.token_set_ratio)
        token_sort_ratio = process.extract(i, ds2.ds2_ID, limit=1,scorer=fuzz.token_sort_ratio)
        actual_name.append(ratio[0][0])
        similarity.append(ratio[0][1])
        similarity_partial.append(partial_ratio[0][1])
        similarity_set.append(token_set_ratio[0][1])
        similarity_sort.append(token_sort_ratio[0][1])

ds1['actual_name'] = pd.Series(actual_name)
ds1['similarity'] = pd.Series(similarity)
ds1['similarity_partial'] = pd.Series(similarity_partial)
ds1['similarity_set'] = pd.Series(similarity_set)
ds1['similarity_sort'] = pd.Series(similarity_sort)```
0

There are 0 best solutions below