Some Chinese characters have multiple Unicode representations. For example, although the characters 金 and 金 are usually rendered the same, they actually have different underlying unicodes (see https://www.compart.com/en/unicode/U+91D1 and https://www.compart.com/en/unicode/U+F90A#UNC_DB). Evidently, the site www.compart.com knows about the links between these two Unicodes as there's a link to the U+91D1 page in the U+F90A page.
Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"?
The solution is Unicode Normalization.
Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"? Yes, there is Unicode Character Database; pay your attention to UnicodeData.txt: technically a
csvfile without header line (fields described here). We are interested in the 5th field (Decomposition_TypeDecomposition_Mapping):Search the file manually or semi-automatic:
findstrin Windows or equivalent bash command (Linux):grepfor your code point91d1:The characters found above are
⾦(U+2FA6, Kangxi Radical Gold)㊎(U+328E, Circled Ideograph Metal)金(U+F90A, CJK Compatibility Ideograph-F90a)㈮(U+322E, Parenthesized Ideograph Metal)金(U+91D1, CJK Ideograph) (missing in abovefindstroutput as it's hidden in the CJK Ideograph block, code points4E00..9FFF).The following Python script could enlighten some aspects of normalization…
Output:
encodeuni.py ⾦㊎金金㈮Further resources (required reading): Unicode® Technical Reports