I have a small piece of code in which I am checking the codepoint for the the character Ü.
Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));
I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.
Output on MacOS
en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220
Output on Windows
en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195
I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220.
For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.
If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220.
Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?
195 is 0xC3 in hex.
In UTF-8,
Üis encoded as bytes0xC3 0x9C.System.getProperty("file.encoding")says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact thatprintln()is outputtingglyph ??(note 2?, meaning 2chars are present), and that you are able to decode the raw string bytes using the UTF-8Charset, proves this.glyphshould have a singlecharwhose value is0x00DC, not 2chars whose values are0x00C3 0x009C.getCodepointAt(0)is returning0x00C3(195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes0xC3 0x9Cget decoded as characters0x00C3 0x009Cinstead of as character0x00DC.You need to specify the actual file encoding when running Java, eg: