Is ASCII-only Unicode string always normalized?

136 Views Asked by Netch At 26 August 2022 at 11:57

Imagine a string of single ASCII character i (U+0069). In Turkish and akin writing system, ı (U+0131) is present as well. Can Unicode normalization split U+0069 (i) into U+0131 U+0307 (ı̇)? Is it locale-dependent, and so might vary on environment?

Original Q&A

There are 1 best solutions below

IMSoP On 26 August 2022 at 12:27 BEST ANSWER

The normali\ation forms defined by Unicode are not locale-specific; they have no input other than the sequence of code points to be normalized.

The Unicode website has a user-friendly chart of all characters which differ between the standardized normalization forms.

Unfortunately, it is grouped by script, not by block, so we can't quickly check all the characters in the "Basic Latin" block (which matches the 128 characters of ASCII).

Searching for "0069" specifically, we see that it appears as the result of normalising certain code points - either as part of a "decomposition" in NFD, or as a compatibility replacement in forms NFKC and NFKD. However, it doesn't appear in the input column, because it doesn't change when converted to any of the normalization forms.

I have not checked the other Basic Latin characters, but would be extremely surprised if any of them normalize to anything other than themselves. So to answer your original question: yes, I believe a string that only uses code points U+0000 to U+007F (the code points inherited from the 7-bit ASCII standard) will not change in any of the normalization forms defined by Unicode.

Is ASCII-only Unicode string always normalized?

There are 1 best solutions below

Related Questions in LOCALE

Related Questions in UNICODE-NORMALIZATION

Trending Questions

Popular # Hahtags

Popular Questions