Fuzzy string matching in Python for structured strings?

663 Views Asked by MYK At 05 April 2023 at 14:17

I have a Python implementation of fuzzy matching using the Levenshtein similarity. I'm pretty happy with it but I feel I'm leaving a lot on the table by not considering the structure of the strings.

Here are some examples of matches that are clearly good, but not captured well by Levenshtein :

The Hobbit / Hobbit/The
Charlies Angles / Charlie's Angels
Apples & Pairs / Apples and Pairs

I think some normalization ahead of using Levenshtein would be good - eg. replace all & with and, remove punctuation, etc... not sure I want to jump straight to stop-word removal and lematization, but something along those line

To avoid re-inventing the wheel, is there any easy way to do this? Or an alternative to levenshtine that addresses these issues (short of some Bert embeddings)

Original Q&A

There are 2 best solutions below

noob On 11 April 2023 at 08:10 BEST ANSWER

rapidfuzz.utils.default_process might be an option to consider for preprocessing.

rapidfuzz.utils.default_process(sentence: str) → str This function preprocesses a string by:

removing all non alphanumeric characters

trimming whitespaces

converting all characters to lower case

PARAMETERS: sentence (str) – String to preprocess

RETURNS: processed_string – processed string

RETURN TYPE: str

https://maxbachmann.github.io/RapidFuzz/Usage/utils.html

Phoenix On 05 April 2023 at 14:24

yes you can use some preporcessing like below and Remove non-alphanumeric characters or convert all to lowercasess or extra spaces:

def preprocess_string(s):
    s = s.lower()
    
    s = s.replace('&', 'and')
    
    s = re.sub(r'[^a-z0-9 ]', '', s)
    
    s = re.sub(r'\s+', ' ', s).strip()

    s = re.sub(r'&', 'and', s).strip()
    
    return s

in fact preprocessing is always crucial for this kind of comparison.

Fuzzy string matching in Python for structured strings?

There are 2 best solutions below

Related Questions in NLP

Related Questions in STRING-MATCHING

Related Questions in FUZZY-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions