Grouping fuzzy matches within same column in a data frame

586 Views Asked by ss_0708 At 20 April 2023 at 08:48

I am trying to group the similar names of companies basis fuzzy matching ( within same column ). But neither they are grouping correctly nor do I have the same number of rows in the resulting dataset. As a result of one to many match, the number of rows are more than what is there in original data.

Input File sample with more records

**Code **

df.loc[:,'Account Name Copy'] = df['Account Name']

compare = pd.MultiIndex.from_product([df['Account Name'],
                                      df['Account Name Copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare.apply(metrics)

Current Output

P.S. The number of rows should remain the same in final output as it is in original data with similar company names being grouped.

Desired Output

Referred to below topics, but didn't get the desired output

https://stackoverflow.com/questions/54865890/fuzzy-match-strings-in-one-column-and-create-new-dataframe-using-fuzzywuzzy

https://stackoverflow.com/questions/71427827/fuzzy-matching-and-grouping

https://stackoverflow.com/questions/60987641/check-if-there-is-a-similar-string-in-the-same-column

https://stackoverflow.com/questions/62085777/fuzzy-match-within-the-same-column-python

Please help !!

Original Q&A

There are 1 best solutions below

Timeless On 20 April 2023 at 10:18

Here is a naïve approach with partial_ratio that (may?) lead to your expected output.

from itertools import product
from fuzzywuzzy import fuzz

df = pd.read_excel("file.xlsx")

RATIO = 80 # <-- adjust the ratio here

tups = list(product(df["Account Name"].unique(),
                    df["Account Name"].str.split(r"[-\s]").str[0].unique()))

matches = [(pair[1].title(), pair[0]) for pair in tuples_list
           if fuzz.partial_ratio(pair[1].lower(), pair[0].lower()) >= RATIO]
    
out = pd.DataFrame(index=pd.MultiIndex.from_tuples(set(matches),
                   names=["Grouped", "Account Name"])).sort_index()

Output :

Grouping fuzzy matches within same column in a data frame

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in PYTHON-3.7

Related Questions in FUZZYWUZZY

Related Questions in FUZZY-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions