Same text in UTF-8 but different in ANSI

345 Views Asked by user2877989 At 10 April 2023 at 03:23

I have a Notepad++. The Encoding is UTF-8, in notepad I have two text

Thành
Thành

But when i use Find dialog to search "Thành" the result has only 1 result. I change the Notepad++ encoding to ANSI. It show

ThÃ nh
ThaÌ€nh

Why are they different in ANSI ? What should i do to make they same ?

Original Q&A

There are 1 best solutions below

JosefZ On 10 April 2023 at 18:38

Your strings differ on Unicode Normalization (demonstrated merely for relevant characters):

Form   String Unicode                        Length
----   ------ -------                        ------
(raw)  à à    \u00e0 \u0061\u0300            4
FormC  à à    \u00e0 \u00e0                  3
FormD  à à    \u0061\u0300 \u0061\u0300      5
FormKC à à    \u00e0 \u00e0                  3
FormKD à à    \u0061\u0300 \u0061\u0300      5

The former string is

T (U+0054, Latin Capital Letter T)
h (U+0068, Latin Small Letter H)
à (U+00E0, Latin Small Letter A With Grave)
n (U+006E, Latin Small Letter N)
h (U+0068, Latin Small Letter H)

while the latter one is

T (U+0054, Latin Capital Letter T)
h (U+0068, Latin Small Letter H)
a (U+0061, Latin Small Letter A)
̀ (U+0300, Combining Grave Accent)
n (U+006E, Latin Small Letter N)
h (U+0068, Latin Small Letter H)

You invoke a mojibake case (example in Python for its universal intelligibility):

print('Thành\nThành'.encode('utf-8').decode('cp1252'))

ThÃ nh
ThaÌ€nh

Same text in UTF-8 but different in ANSI

There are 1 best solutions below

Related Questions in ENCODING

Related Questions in CHARACTER-ENCODING

Related Questions in CULTURE

Trending Questions

Popular # Hahtags

Popular Questions