I am developing a Java application where I get a value of type char[] from an external C++ dll.
There are cases in which non-ASCII values are expected to be input.
In such a case, it works normally when I construct a String by only passing it a byte[] which is converted from the hex-string interpretation of the input value.
On the other hand, I had problem when I construct a String by passing a character array which is made up from a for-loop in which each byte is cast to char, one-by-one.
In the example below, a char[] variable is obtained from the aforementioned dll where the input is a string with the value "çap" but comes with a hex-string value of C3A76170.
// the StringUtil.toByteArray function converts hex-string to a byte array
byte[] byteArray = StringUtil.toByteArray("C3A76170");
Below example yields the expected result:
String s1 = new String(byteArray);
// print
System.out.println(s1)
çap
Below example does not yield the expected result:
char[] chars = new char[byteArray.length];
for (int i = 0; i < chars.length; i++) {
chars[i] = (char) byteArray[i];
}
String s2 = new String(chars);
// print
System.out.println(s2);
ᅢᄃap
In the second example, the output is "ᅢᄃap" (where the character "ç" is apparently misinterpret as a different character).
What can cause this discrepancy between outputs? What is the reasoning behind this behavior?
C and C++ use the
chartype to represent a single byte. However,byteandcharare not the same thing in Java. Unicode has over 100,000 codepoints, so obviously a single byte is not capable of representing all characters. There is no choice other than using multiple bytes to represent some characters.The exact method for using multiple bytes to represent a single character is known as a Charset, also known as a character encoding (or sometimes just “encoding”).
The most popular charset is UTF-8, because it is a compact representation of Latin languages and because it is compatible with ASCII. Your C++ library returned "çap" as a UTF-8 byte sequence.
When your code does
new String(byteArray), it is using a Charset to translate the bytes to characters. In modern versions of Java, that Charset is always UTF-8. (Older versions of Java will use the system’s default charset, which happens to be UTF-8 on all systems other than Windows.)When your code does
(char) byteArray[i], it is forcing each byte to act as its own character, ignoring the possibility of multi-byte sequences.çis represented in UTF-8 as the two bytes 0xc3 0xa7. The two bytes are not separate characters; together they represent a single char.It is almost never correct to assume one byte is equivalent to one character.
(Also, feel free to read the obligatory Joel blog on the subject.)