Replace accented character with html entity

2.9k Views Asked by Jordi At 10 May 2018 at 18:56

I'm trying to automate a series of queries but, I need to replace characters with accents with the corresponding html entity. It needs to be in Python3

Example:

vèlit 
[needs to become] 
v&egrave;lit

The thing is, whenever I try to do a word.replace, it doesn't find it.

This:

if u'è' in sentence:
    print(u'Found è')

Works and finds "è", but doing:

word.replace('è','&egrave;')

Doesn't do anything.

Original Q&A

There are 3 best solutions below

Anthony On 10 May 2018 at 19:03

Replace word.replace('è','è') with word = word.replace('è','è') and print the result to check.

word.replace('è','è') does work, but it doesn't actually make any changes to the word content itself.

Check str.replace()

snakecharmerb On 12 May 2018 at 09:37

You can use the str.translate method and the data in python's html package to convert characters to the equivalent html entity.

To do this, str.translate needs a dictionary that maps characters (technically the character's integer representation, or ordinal) to html entities.

html.entities.codepoint2name contains the required data, but the entity names are not bounded by '&' and ';'. You can use a dict comprehension to create a table with the values you need.

Once the table has been created, call your string's translate method with the table as the argument and the result will be a new string in which any characters with an html entity equivalent will have been converted.

>>> import html.entities
>>> s = 'vèlit'

>>> # Create the translation table
>>> table = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}

>>> s.translate(table)
'v&egrave;lit'

>>> 'Voilà'.translate(table)
'Voil&agrave;'

Be aware that accented latin characters may be represented by a combination of unicode code points: 'è' can be represented by the single code point - LATIN SMALL LETTER E WITH GRAVE - or two codepoints - LATIN SMALL LETTER E followed by COMBINING GRAVE ACCENT. In the latter case (known as the decomposed form), the translation will not work as expected.

To get around this, you can convert the two-codepoint decomposed form to the single codepoint composed form using the normalize function from the unicodedata module in Python's standard library.

>>> decomposed
'vèlit'
>>> decomposed == s
False
>>> len(decomposed)    # decomposed is longer than composed
6
>>> decomposed.translate(table)
'vèlit'
>>> composed = unicodedata.normalize('NFC', decomposed)
>>> composed == s
True
>>> composed.translate(table)
'v&egrave;lit'

Philip Colmer On 04 May 2021 at 13:20

As an update to the answer provided by snakecharmerb, it may be helpful to know that Python 3.3 introduced html.entities.html5 which maps more characters to the equivalent Unicode characters.

For me, I needed that dictionary because codepoint2name didn't include ł.

So, the code to create the translation table is slightly changed to this:

table = {get_wide_ordinal(v): '&{}'.format(k) for k, v in html.entities.html5.items()}

where get_wide_ordinal I got from https://stackoverflow.com/a/7291240/1233830:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

because some of the characters in the html5 lookup are two-bytes wide.

Note that the HTML5 entities in this table do end with a ; which is why that is removed from the format string.

Replace accented character with html entity

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in REPLACE

Related Questions in PYTHON-UNICODE

Related Questions in ISO-8859-15

Trending Questions

Popular # Hahtags

Popular Questions