I have a text which is mixed in English and Hindi and I want to remove all the characters except Hindi and English characters and numbers and punctuation. That way, I can get rid of "(", ")","@", etc. Please consider the text below.
text = नई दिल्ली। Navjot Singh Sidhu Resigns. पंजाब विधानसभा चुनाव से पहले पंजाब कांग्रेस में कई बड़े बदलाव देखने को मिल रहे हैं। पहले कैप्टन अमरिंदर सिंह का पंजाब के मुख्यमंत्री पद से इस्तीफा दिया, उसके बाद चरणजीत सिंह चन्नी को राज्य का नया मुख्यमंत्री बनाया गया। वहीं अब नवजोत सिंह सिद्धू ने पंजाब कांग्रेस अध्यक्ष पद से इस्तीफा दे दिया है।I told you so…he is not a stable man and not fit for the border state of punjab.— Capt.Amarinder Singh (@capt_amarinder) September 28, 2021सोनिया गांधी को लिखा पत्र बता दें कि नवजोत सिंह सिद्धू ने काग्रेस अध्यक्ष सोनिया गांधी को एक पत्र लिखकर इस संबंध में जानकारी दी है। पत्र में सिद्धू ने यह भी कह कि वे कांग्रेस का हिस्सा बने रहेंगे।pic.twitter.com/L5wdRql5t3— Navjot Singh Sidhu (@sherryontopp) September 28, 2021
You can use python regex method
re.sub(pattern,substring,string)which substitutes matchingpatternin thestringwithsubstring. In your case, since you want to delete the characters, you would substitute with an empty string'':Explanation
[^...]used to indicate a set of characters NOT to be matchded (any character in that set will NOT be matched);\wfor Unicode (str) patterns matches Unicode word characters; this includes alphanumeric characters (as defined by str.isalnum()) as well as the underscore (_);\sfor Unicode (str) patterns matches Unicode whitespace characters (which includes[ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages);+Causes the resulting regex to match 1 or more repetitions of the preceding regex.from Python regex documentation.
So this means anything that is not a unicode alphanumeric character or a unicode space will get deleted (replaced with an empty string).
You can modify the matching pattern in case you want to include/exclude other characters. For example, if you want to keep punctuation, you would have:
If, instead, you want to also delete underscores (
_), you will have to change\winto explicit set of characters you want to keep, in your case:\u0900-\u0954 are unicode characters from U+900 to U+0954 which are the Devanagari (Unicode block) characters used in Hindi.