Merging similar categories in a Dataframe column

18 Views Asked by At

this is my first question so please bear with me, I'll try to be as concise as possible.

I have a dataframe with a column for sector. The column contains a text description of the industry a company is in e.g. Fintech, Retail, Sales, Healthcare etc. The problem is that these descriptions are written in varying text case example: (FinTEch, Fintech, fintTech). What I wanted to do is combine the similar labels and rename them, for example all kinds of Fintech to be FinTech.

Screenshot of some of the sectors I've grouped

I have tried to write a function using thefuzz to group them and played around with the threshold level. I expected it to combine all labels and it does for the most part, but I realized I also need to rename them to just one instead of returning a whole list of the similar labels as that is very messy in use. That's where I'm kind of stuck. I've not been able to figure out how to do that.

Here's the function:

def combine_similar_labels(labels_series, threshold=80): """ Combines similar labels in a pandas Series based on text similarities.

Args:
    labels_series (pd.Series): The Series containing labels.
    threshold (int, optional): Similarity threshold (default is 80).

Returns:
    pd.Series: A new Series with combined labels.
"""
# Create an empty dictionary to store combined labels
combined_labels = {}

# Iterate through each label
for label in labels_series.unique():
    # Find close matches for the current label
    matches = process.extract(label, labels_series.unique(), limit=None)
    
    # Filter matches based on similarity threshold
    close_matches = [match[0] for match in matches if match[1] >= threshold]
    
    # Combine similar labels into a single label
    combined_label = ', '.join(close_matches)
    
    # Store the combined label
    combined_labels[label] = combined_label

# Map the original labels to their combined versions
combined_series = labels_series.map(combined_labels)

return combined_series

Maybe there's a better way of doing this and I'm open to suggestions or further questions if I'm not clear enough. Thanks

0

There are 0 best solutions below