Grouping fuzzy matches within same column in a data frame

586 Views Asked by At

I am trying to group the similar names of companies basis fuzzy matching ( within same column ). But neither they are grouping correctly nor do I have the same number of rows in the resulting dataset. As a result of one to many match, the number of rows are more than what is there in original data.

Input File sample with more records

enter image description here

Input File/Column

  • **Code **
df.loc[:,'Account Name Copy'] = df['Account Name']

compare = pd.MultiIndex.from_product([df['Account Name'],
                                      df['Account Name Copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare.apply(metrics)

Current Output

- Current Output

P.S. The number of rows should remain the same in final output as it is in original data with similar company names being grouped.

Desired Output

Desired Output

Referred to below topics, but didn't get the desired output

https://stackoverflow.com/questions/54865890/fuzzy-match-strings-in-one-column-and-create-new-dataframe-using-fuzzywuzzy

https://stackoverflow.com/questions/71427827/fuzzy-matching-and-grouping

https://stackoverflow.com/questions/60987641/check-if-there-is-a-similar-string-in-the-same-column

https://stackoverflow.com/questions/62085777/fuzzy-match-within-the-same-column-python

Please help !!

1

There are 1 best solutions below

3
Timeless On

Here is a naïve approach with partial_ratio that (may?) lead to your expected output.

from itertools import product
from fuzzywuzzy import fuzz

df = pd.read_excel("file.xlsx")

RATIO = 80 # <-- adjust the ratio here

tups = list(product(df["Account Name"].unique(),
                    df["Account Name"].str.split(r"[-\s]").str[0].unique()))

matches = [(pair[1].title(), pair[0]) for pair in tuples_list
           if fuzz.partial_ratio(pair[1].lower(), pair[0].lower()) >= RATIO]
    
out = pd.DataFrame(index=pd.MultiIndex.from_tuples(set(matches),
                   names=["Grouped", "Account Name"])).sort_index()

Output :

enter image description here