Regex to replace word not working for Indic languages

42 Views Asked by At

I'm trying to replace a substring in a main string with a replacement text. The substring should exist as a word, hence preferred using regex. The python code works for English text but fails for Hindi text.

I've tried following code:

def replace_str(text, substring_to_replace, replacement_text):
    modified_text = re.sub(
        rf"\b{substring_to_replace}\b", replacement_text,
        text, flags=re.IGNORECASE
    )
    return modified_text

When given the English input text:

text = "This is a dummy english text."
substring_to_replace = "is"
replacement_text = "##"

modified_text = replace_str(text, substring_to_replace, replacement_text)
print(modified_text)

it prints: This ## a dummy english text.

But for the Hindi text:

text = "आपको किन विषयों का अध्ययन करने की आवश्यकता है।"
substring_to_replace = "विषय"
replacement_text = "##"

modified_text = replace_str(text, substring_to_replace, replacement_text)
print(modified_text)

it prints: आपको किन ##ों का अध्ययन करने की आवश्यकता है।

The hindi substring विषय shouldn't have been found in the text as a word, but was still replaced.

I've tried using re.UNICODE regex flag as well with no luck.

0

There are 0 best solutions below