I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...
I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:
myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"
myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)
I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...
However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !
And, there's another format!
myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"
FWIW: charcodes column is of CLOB type.
I know U+002D is '-' and U+0030 is '0'...
My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)
So, again, my questions are:
- What is this "U+043E;U+006F,U+004D" format?
- If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
UPDATED
In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:
RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743
It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.
Test
Output