Python: Get substring of a string with a closest match to another string

96 Views Asked by At

A nice algorithmic trivia for you today. :)

I have two strings – one is a longer sentence and the another one is shorter sentence that was discovered by LLM within the longer one. Let's see an example:

  • Long sentence: "If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases."
  • Short sentence: "Apple MacBook Pro 15" M2"

I need to mark the long sentence string with what is closest to the short string. The outcome would be the char [start:end] position indexes.

Acceptable outcomes could be like this (one of):

If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^ [47:65]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:76]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:87]

I considered:

  • membership operator,
  • difflib methods,
  • regex,
  • Levenshtein,

but nothing really suits the case.

The closes to what I can think of is this:

  1. Get the length = len(short_string).
  2. Split long_string by whitespace into a set of substrings of length length.
  3. Calculate Levenshtein difference between short_string and each substring.
  4. The closest distance wins it.
short_string = "four five eight"
long_string = "one two three four five six seven eight nine"

length = 3

substrings = [
  "one two three",
  "two three four",
  "three four five",
  "four five six",
  "five six seven",
  "six seven eight",
  "seven eight nine"
]

for sentence in substrings:
  Levenshtein.distance(sentence, short_string)

winner = "four five six"

Any other ideas or open-source tools that you can think of?

1

There are 1 best solutions below

0
blhsing On

The following approach should work reasonably well for your purpose:

  1. Split the short sentence into words and join them into a regex of alternation pattern separated by |s.
  2. Find all matches of the regex within the long sentence with re.finditer, which yields re.Match objects with the starting and ending indices of each match of a word in the short sentence.
  3. Use itertools.combinations to generate all combinations of pairs of Match objects. Each pair of Match objects will be used to slice the long sentence with the starting index of the first Match object and the ending index of the second.
  4. Use the max function to pick from the combinations of Match object pairs with the highest similarity ratio of the sliced long sentence and the short sentence, as calculated by difflib.SequenceMatcher.ratio:

So with:

import re
from difflib import SequenceMatcher
from itertools import combinations

def closest_substring(long, short):
    a, b = max(
        combinations(re.finditer('|'.join(short.split()), long), 2),
        key=lambda c: SequenceMatcher(None, long[c[0].start():c[1].end()], short).ratio()
    )
    return long[a.start():b.end()]

The following code:

long_string = "If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases."
short_string = 'Apple MacBook Pro 15" M2'
print(closest_substring(long_string, short_string))

will output:

MacBook Pro 15inch with an M2

Demo: Try it online!