Same text in UTF-8 but different in ANSI

345 Views Asked by At

I have a Notepad++. The Encoding is UTF-8, in notepad I have two text

Thành
Thành

But when i use Find dialog to search "Thành" the result has only 1 result. I change the Notepad++ encoding to ANSI. It show

Thành
Thành

Why are they different in ANSI ? What should i do to make they same ?

1

There are 1 best solutions below

2
JosefZ On

Your strings differ on Unicode Normalization (demonstrated merely for relevant characters):

Form   String Unicode                        Length
----   ------ -------                        ------
(raw)  à à    \u00e0 \u0061\u0300            4
FormC  à à    \u00e0 \u00e0                  3
FormD  à à    \u0061\u0300 \u0061\u0300      5
FormKC à à    \u00e0 \u00e0                  3
FormKD à à    \u0061\u0300 \u0061\u0300      5

The former string is

  • T (U+0054, Latin Capital Letter T)
  • h (U+0068, Latin Small Letter H)
  • à (U+00E0, Latin Small Letter A With Grave)
  • n (U+006E, Latin Small Letter N)
  • h (U+0068, Latin Small Letter H)

while the latter one is

  • T (U+0054, Latin Capital Letter T)
  • h (U+0068, Latin Small Letter H)
  • a (U+0061, Latin Small Letter A)
  • ̀ (U+0300, Combining Grave Accent)
  • n (U+006E, Latin Small Letter N)
  • h (U+0068, Latin Small Letter H)

You invoke a mojibake case (example in Python for its universal intelligibility):

print('Thành\nThành'.encode('utf-8').decode('cp1252'))
Thành
Thành