Replace accented character with html entity

2.9k Views Asked by At

I'm trying to automate a series of queries but, I need to replace characters with accents with the corresponding html entity. It needs to be in Python3

Example:

vèlit 
[needs to become] 
vèlit

The thing is, whenever I try to do a word.replace, it doesn't find it.

This:

if u'è' in sentence:
    print(u'Found è')

Works and finds "è", but doing:

word.replace('è','è')

Doesn't do anything.

3

There are 3 best solutions below

0
Anthony On

Replace word.replace('è','è') with word = word.replace('è','è') and print the result to check.

word.replace('è','è') does work, but it doesn't actually make any changes to the word content itself.

Check str.replace()

1
snakecharmerb On

You can use the str.translate method and the data in python's html package to convert characters to the equivalent html entity.

To do this, str.translate needs a dictionary that maps characters (technically the character's integer representation, or ordinal) to html entities.

html.entities.codepoint2name contains the required data, but the entity names are not bounded by '&' and ';'. You can use a dict comprehension to create a table with the values you need.

Once the table has been created, call your string's translate method with the table as the argument and the result will be a new string in which any characters with an html entity equivalent will have been converted.

>>> import html.entities
>>> s = 'vèlit'

>>> # Create the translation table
>>> table = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}

>>> s.translate(table)
'vèlit'

>>> 'Voilà'.translate(table)
'Voilà'

Be aware that accented latin characters may be represented by a combination of unicode code points: 'è' can be represented by the single code point - LATIN SMALL LETTER E WITH GRAVE - or two codepoints - LATIN SMALL LETTER E followed by COMBINING GRAVE ACCENT. In the latter case (known as the decomposed form), the translation will not work as expected.

To get around this, you can convert the two-codepoint decomposed form to the single codepoint composed form using the normalize function from the unicodedata module in Python's standard library.

>>> decomposed
'vèlit'
>>> decomposed == s
False
>>> len(decomposed)    # decomposed is longer than composed
6
>>> decomposed.translate(table)
'vèlit'
>>> composed = unicodedata.normalize('NFC', decomposed)
>>> composed == s
True
>>> composed.translate(table)
'vèlit'
0
Philip Colmer On

As an update to the answer provided by snakecharmerb, it may be helpful to know that Python 3.3 introduced html.entities.html5 which maps more characters to the equivalent Unicode characters.

For me, I needed that dictionary because codepoint2name didn't include ł.

So, the code to create the translation table is slightly changed to this:

table = {get_wide_ordinal(v): '&{}'.format(k) for k, v in html.entities.html5.items()}

where get_wide_ordinal I got from https://stackoverflow.com/a/7291240/1233830:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

because some of the characters in the html5 lookup are two-bytes wide.

Note that the HTML5 entities in this table do end with a ; which is why that is removed from the format string.