Is this format: "U+043E;U+006F,U+004D" some sort of encoding standard and does java offer a standard library method to convert it to char?

Question

Is this format: "U+043E;U+006F,U+004D" some sort of encoding standard and does java offer a standard library method to convert it to char?

151 Views Asked by DraxDomax At 01 February 2021 at 00:32

I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...

I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:

myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"

myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)

I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...

However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !

And, there's another format! myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"

FWIW: charcodes column is of CLOB type.

I know U+002D is '-' and U+0030 is '0'...

My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)

So, again, my questions are:

What is this "U+043E;U+006F,U+004D" format?
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?

Original Q&A

There are 3 best solutions below

**Andreas** · Answer 1 · 2021-02-01T03:52:10.423000

UPDATED

What is this "U+043E;U+006F,U+004D" format?

In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:

This table conforms to the format specified in RFC 3743.

RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743

If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?

It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.

// Java 11+
static String decodeUnicode(String input) {
    return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
            Character.toString(Integer.parseInt(mr.group().substring(2), 16)));
}

// Java 9+
static String decodeUnicode(String input) {
    return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
            new String(new int[] { Integer.parseInt(mr.group().substring(2), 16) }, 0, 1));
}

// Java 1.5+
static String decodeUnicode(String input) {
    StringBuffer buf = new StringBuffer();
    Matcher m = Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input);
    while (m.find()) {
        String hexString = m.group().substring(2);
        int codePoint = Integer.parseInt(hexString, 16);
        String unicodeCharacter = new String(new int[] { codePoint }, 0, 1);
        m.appendReplacement(buf, unicodeCharacter);
    }
    return m.appendTail(buf).toString();
}

Test

System.out.println(decodeUnicode("#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"));

Output

#
-;-;=,M,-
0;0;0

**Joachim Sauer** · Answer 2 · 2021-02-01T11:40:59.230000

U+0000 is a representation of a Unicode Codepoint and the format is defined in Apendix A of the Unicode Standard. The numbers are simply the hex-encoded number of the represented codepoint. For historical reasons they are always left-padded to at least 4 digits with 0, but can be up to 6 digits long.

It is not primarily meant as a machine-readable encoding, but rather as a human-readable representation of Unicode codepoints for use in running text (i.e. paragraphs such as this one). Note especially that this format does not have a way to distinguish a four-character number followed by some numbers from a 5- or 6-digit number. So U+123456 could be interpreted in 3 different was: U+1234 followed by the text 56, U+12345 followed by the text 6 or U+123456. This makes it unsuited for automatic replacement and use as a general-purpose encoding.

As such there is no built-in functionality to parse this into its equivalent String or similar in Java.

The following code can be used to parse a single Unicode codepoint reference into the appropriate codepoint in a String:

  public static String codePointToString(String input) {
    if (!input.startsWith("U+")) {
      throw new IllegalArgumentException("Malformed input, doesn't start with U+");
    }
    int codepoint = Integer.parseInt(input.substring(2), 16);
    if (codepoint < 0 || codepoint > Character.MAX_CODE_POINT) {
      throw new IllegalArgumentException("Malformed input, codepoint value out of valid range: " + codepoint);
    }
    return Character.toString(codepoint);
  }

(Before Java 11 the return line needs to use new String(new int[] { codepoint }, 0, 1) instead).

And if you want to replace all Unicode codepoints represented in a text by their actual text (which might render it unreadable in some cases) you can use this (together with the method above):

  private static final Pattern PATTERN = Pattern.compile("U\\+[0-9A-Za-z]{4,6}");
  
  public static String decodeCodePoints(String input) {
    return PATTERN
        .matcher(input)
        .replaceAll(result -> codePointToString(result.group()));
  }

**Michael Gantman** · Answer 3 · 2021-02-01T12:10:08.487000

Actually, I wrote an Open Source Library called MgntUtils that has a utility that can very much help you. The codes that you see are unicode sequences where each U+XXXX represents a character. The utility in the library can convert any string in any language (including special characters) into Unicode sequences and vise-versa. Here is a sample of how it works:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

Is this format: "U+043E;U+006F,U+004D" some sort of encoding standard and does java offer a standard library method to convert it to char?

There are 3 best solutions below

Related Questions in JAVA

Related Questions in CHARACTER-ENCODING

Related Questions in MULTILINGUAL

Related Questions in IDN

Trending Questions

Popular # Hahtags

Popular Questions