Replace emoji with other text

75 Views Asked by At

I need to replace all emojis from text with the form ["emoji here"](emoji/1234567890). I wrote this code:

entities = [. . .] # ids for my emojies

emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
emojis = [match.group() for match in re.finditer(emoji_pattern, text)]
emoji_dict = {emoji: [] for emoji in set(emojis)}
for i, emoji in enumerate(emojis):
    emoji_dict[emoji].append(i)
new_text = replace_emoji(emoji_dict, entities, text)


def replace_emoji(emoji_dict, entities, text):
    for emoji, indices in emoji_dict.items():
        for index in indices:
            text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
    return text

emoji_dict looks something like this: {'': [0], '': [1, 2, 3, 4, 5]} where the numbers are the index of the value from the entities list

If an emoji occurs in the text only once (as in the case of ), then everything is displayed correctly: [](emoji/1234567890), but if an emoji occurs several times (as in the case of ), then this is displayed like this: [[](emoji/5235873473821159415)](emoji/5235851187235861094)[[](emoji/5235873473821159415)](emoji/5235851187235861094)

Tell me how can I fix this error?

Example:

example text

text = '''Hello, #️⃣ user #️⃣ How's your day going?  I hope everything is going great for you!  If you have any questions, feel free to ask. I'm here to help! '''

. . .

new_text = '''Hello, [#️⃣](emoji/12352352340) user [#️⃣](emoji/12352352340) How's your day going? [](emoji/1245531421) I hope everything is going great for you! [](emoji/523424120) If you have any questions, feel free to ask. I'm here to help! [](emoji/90752893562)'''
1

There are 1 best solutions below

0
Barmar On BEST ANSWER

When you do

for index in indices:
    text = re.sub(..., text)

The first iteration replaces the emoji with f'[{emoji}](emoji/{indices[0]})'. Then the second iteration replaces the emoji inside the [] with f'[{emoji}](emoji/{indices[1]})', and so on, so you get a series of nested replacements. You don't want to replace inside a previous replacement.

In your desired output, you use the same entity for all the repetitions of an emoji. So there's no need to make a list of indices for each emoji, or loop over them when making the replacements. emoji_dict should just have one index for each emoji, and you can replace all of them with the corresponding entity.

import re

text = "He llo, #️⃣ user #️⃣ How's your day going?  I hope everything is going great for you!  If you have any questions, feel free to ask. I'm here to help! "
entities = [12345, 67890, 23456, 78901] # ids for my emojies

def replace_emoji(emoji_dict, entities, text):
    for emoji, index in emoji_dict.items():
        text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
    return text

emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
emojis = re.findall(emoji_pattern, text)
emoji_dict = {}
for i, emoji in enumerate(set(emojis)):
    emoji_dict[emoji] = i
new_text = replace_emoji(emoji_dict, entities, text)

print(new_text)

output:

He[](emoji/67890) llo, #️⃣ user #️⃣ How's your day going? [](emoji/67890) I hope everything is going great for you! [](emoji/12345) If you have any questions, feel free to ask. I'm here to help! 

#️⃣ and are not replaced because they aren't matched by the regexp.