Chinese characters with multiple unicode representations

Question

Chinese characters with multiple unicode representations

600 Views Asked by Tim Mak At 15 July 2022 at 09:24

Some Chinese characters have multiple Unicode representations. For example, although the characters 金 and 金 are usually rendered the same, they actually have different underlying unicodes (see https://www.compart.com/en/unicode/U+91D1 and https://www.compart.com/en/unicode/U+F90A#UNC_DB). Evidently, the site www.compart.com knows about the links between these two Unicodes as there's a link to the U+91D1 page in the U+F90A page.

Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"?

Original Q&A

There are 1 best solutions below

**JosefZ** · Accepted Answer · 2022-07-17T13:30:06.133000

The solution is Unicode Normalization.

Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"? Yes, there is Unicode Character Database; pay your attention to UnicodeData.txt: technically a csv file without header line (fields described here). We are interested in the 5th field (Decomposition_Type Decomposition_Mapping):

(5) This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings.

Search the file manually or semi-automatic: findstr in Windows or equivalent bash command (Linux): grep for your code point 91d1:

findstr /I /R "\<91d1\>" "\Utils\CodePages\UnicodeData.txt"

2FA6;KANGXI RADICAL GOLD;So;0;ON;<compat> 91D1;;;;N;;;;;
322E;PARENTHESIZED IDEOGRAPH METAL;So;0;L;<compat> 0028 91D1 0029;;;;N;;;;;
328E;CIRCLED IDEOGRAPH METAL;So;0;L;<circle> 91D1;;;;N;;;;;
F90A;CJK COMPATIBILITY IDEOGRAPH-F90A;Lo;0;L;91D1;;;;N;;;;;

The characters found above are

⾦ (U+2FA6, Kangxi Radical Gold)
㊎ (U+328E, Circled Ideograph Metal)
金 (U+F90A, CJK Compatibility Ideograph-F90a)
㈮ (U+322E, Parenthesized Ideograph Metal)
金 (U+91D1, CJK Ideograph) (missing in above findstr output as it's hidden in the CJK Ideograph block, code points 4E00..9FFF).

The following Python script could enlighten some aspects of normalization…

import sys
from unicodedata import normalize

def encodeuni(s):
    '''
    Returns input string encoded to escape sequences as in a string literal.
    Output is similar to
      str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
    but even every ASCII character is encoded as a \\xNN escape sequence
    (except a space character). For instance: 
    
    s = 'A á ř ';
    encodeuni(s);       # '\\x41 \\xe1 \\u0159 \\U0001f308'     while 
    str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
    #                   #    'A \\xe1 \\u0159 \\U0001f308'
    '''
    def encodechar(ch):
        ordch = ord(ch)
        return ( ch                if ordch == 0x20   else 
                 f"\\x{ordch:02x}" if ordch <= 0xFF   else
                 f"\\u{ordch:04x}" if ordch <= 0xFFFF else
                 f"\\U{ordch:08x}" )
                 
    return ''.join([encodechar(ch) for ch in s]) 

if len(sys.argv) >= 2 and sys.argv[1] != '':
    letters = (' '.join(
    [sys.argv[i] for i in range(1,len(sys.argv))])).strip()
    # .\SO\59979037.py  ÅÅÅ
else:
    letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
    #          \u212B                     Å Angstrom Sign
    #                 \u00C5              Å Latin Capital Letter A With Ring Above
    #                        \u0041       A Latin Capital Letter A
    #                              \u030A ̊  Combining Ring Above
    #                                     \U0001f308  Rainbow

print('\t'.join( ['raw' ,
                  letters.ljust(10),
                  str(len(letters)),
                  encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
    letnorm = normalize(form, letters)
    print( '\t'.join( [form,
                      letnorm.ljust(10),
                      str(len(letnorm)),
                      encodeuni(letnorm)]))

Output: encodeuni.py ⾦㊎金金㈮

raw     ⾦㊎金金㈮      5       \u2fa6\u328e\uf90a\u91d1\u322e

NFC     ⾦㊎金金㈮      5       \u2fa6\u328e\u91d1\u91d1\u322e
NFKC    金金金金(金)    7       \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29
NFD     ⾦㊎金金㈮      5       \u2fa6\u328e\u91d1\u91d1\u322e
NFKD    金金金金(金)    7       \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29

Further resources (required reading): Unicode® Technical Reports

Chinese characters with multiple unicode representations

There are 1 best solutions below

Related Questions in UNICODE

Related Questions in CJK

Related Questions in CHINESE-LOCALE

Trending Questions

Popular # Hahtags

Popular Questions