A nice algorithmic trivia for you today. :)
I have two strings – one is a longer sentence and the another one is shorter sentence that was discovered by LLM within the longer one. Let's see an example:
- Long sentence: "If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases."
- Short sentence: "Apple MacBook Pro 15" M2"
I need to mark the long sentence string with what is closest to the short string. The outcome would be the char [start:end] position indexes.
Acceptable outcomes could be like this (one of):
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
^^^^^^^^^^^^^^^^^^ [47:65]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:76]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:87]
I considered:
- membership operator,
difflibmethods,- regex,
- Levenshtein,
but nothing really suits the case.
The closes to what I can think of is this:
- Get the
length = len(short_string). - Split
long_stringby whitespace into a set of substrings oflengthlength. - Calculate Levenshtein difference between
short_stringand each substring. - The closest distance wins it.
short_string = "four five eight"
long_string = "one two three four five six seven eight nine"
length = 3
substrings = [
"one two three",
"two three four",
"three four five",
"four five six",
"five six seven",
"six seven eight",
"seven eight nine"
]
for sentence in substrings:
Levenshtein.distance(sentence, short_string)
winner = "four five six"
Any other ideas or open-source tools that you can think of?
The following approach should work reasonably well for your purpose:
|s.re.finditer, which yieldsre.Matchobjects with the starting and ending indices of each match of a word in the short sentence.itertools.combinationsto generate all combinations of pairs ofMatchobjects. Each pair ofMatchobjects will be used to slice the long sentence with the starting index of the firstMatchobject and the ending index of the second.maxfunction to pick from the combinations ofMatchobject pairs with the highest similarity ratio of the sliced long sentence and the short sentence, as calculated bydifflib.SequenceMatcher.ratio:So with:
The following code:
will output:
Demo: Try it online!