How can I find all hex codes for a Unicode character?

146 Views Asked by At

I have to find some special characters using a Python regular expression. For example, for the character 'à', I found some hex codes to build a regex pattern:

r'\x61\x300|\xe0|\x61\x300'

But I am afraid that I may miss some other hex codes. How can I find all possible hex codes for a character?

2

There are 2 best solutions below

7
tripleee On

This is a bit of an XY problem. You want something like

import unicodedata as u
import re

result = re.findall(
    u.normalize("NFC", "à"),
    u.normalize("NFC", inputstring))

If your regex needs to contain more than just a static literal string, probably compose it out of normalized Unicode strings and other regex constructs (though in simple cases individual regex primitives like ^ and . and * should be robust under Unicode normalization).

example = re.match(
    r'^' + u.normalize("NFC", "à") + r'$',
    u.normalize("NFC", inputstring)
similarly = re.finditer(
    f"^{u.normalize("NFC", "à")}$",
    u.normalize("NFC", inputstring)
yolo = re.search(
    u.normalize("NFC", r"^à$"),
    u.normalize("NFC", inputstring))

You can use another normalization if you prefer; the crucial requirement is to use the same normalization for both inputs. But NFC is recommended for this type of scenario.

See also https://www.unicode.org/faq/normalization.html and Normalizing Unicode

To really solve the stated problem, you have to understand how normalization works. If there are multiple combining diacritics, you have to generate all possible orderings of them, for example. See also How does Zalgo text work?

7
Andj On

What the OP required is a list of canonically equivalent codepoint sequences. The easiest solution is to use PyICU, which can provide canonical equivalents, then covert to regex pattern. This is a variation of code we use in our projects. The core is the get_ce_pattern function, which:

  1. normalises the character
  2. creates an icu iterator of canonical equivalents
  3. use list comprehension of the iterator to help build a regex pattern

The function can be modified to process a string, or exclude deprecated characters.

é has three canonical forms, while has five.

If there is only one the combining mark (diacritic), or if more than one and they belong to the same combining class, all you need is the NFC and NFD equivalents of the character (assuming deprecated marks are excluded from consideration).

If there are more than one diacritic, and some belong to different combining classes, then canonically equivalent variations outside of NFC and NFD forms also need to be considered.

import icu
import regex as re

def get_ce_pattern(char, caseless=False):
    char = icu.Normalizer2.getNFCInstance().normalize(char)
    ci =  icu.CanonicalIterator(char)
    if caseless:
        return rf'(?i){"|".join([c for c in ci])}'
    return rf'{"|".join([c for c in ci])}'

line = "La sottise, l'erreur, le péché, la le\u0301sine,"
result = re.findall(get_ce_pattern('é'), line)
print(f"Pattern: {get_ce_pattern('é')}")
# Pattern: é|é|é
print(f"Result: {result}")
# Result: ['é', 'é', 'é']

line2 = "Trải qua mo\u0302\u0323t cuộc bể dâu"
result2 = re.findall(get_ce_pattern('ộ'), line2)
print(f"Pattern2: {get_ce_pattern('ộ')}")
# Pattern2: ộ|ộ|ộ|ộ|ộ
print(f"Result2: {result2}")
# Result2: ['ộ', 'ộ']

Edit:

Often when working with text from older PDF files, I tend to find presentation forms in the text, in which case canonical equivalents are insufficient, and it is necessary to work with compatibility decomposition as well. Starting with get_ce_pattern:

def get_pf_pattern(char, caseless=False):
    results = [char, ud.normalize("NFKC", char)]
    if ud.normalize("NFKD", char) != ud.normalize("NFKC", char):
        results.append(ud.normalize("NFKD", char))
    if caseless:
        return rf'(?i){"|".join([c for c in results])}'
    return rf'{"|".join([c for c in results])}'

def equivalents_to_pattern(char, caseless=False):
    pattern = r'^[\p{Block=Alphabetic_Presentation_Forms}\p{Block=Arabic_Presentation_Forms_A}\p{Block=Arabic_Presentation_Forms_B}]$'
    if regex.match(pattern, char):
        return get_pf_pattern(char, caseless=caseless)
    return get_ce_pattern(char, caseless=caseless)

equivalents_to_pattern('å')
# 'å|å'
equivalents_to_pattern('\uFE82')
# 'ﺂ|آ|آ'
equivalents_to_pattern('á')
# 'á|á|á'
equivalents_to_pattern('á', caseless=True)
# '(?i)á|á|á'
equivalents_to_pattern('\uFB00')
# 'ff|ff'

N.B. I am using the regex module instead of re since it supports Perl and Posix notation for Unicode blocks.