Java -- How to unescape unicode private-use characters?

320 Views Asked by At

I have a program that reads a list of unescaped unicode strings (u/XXXX) and converts them into their encoded unicode character, writing that version to both the terminal and to a textfile.

I'm using org.apache.commons.text.StringEscapeUtils.unescapeJava(String) to handle the unescaping of the escaped unicode points. (From Apache Commons Text library.)

I'm referring to these unicode entries to get my private-use characters: https://jrgraphix.net/r/Unicode/E000-F8FF (I preprend u/ with the hex digits shown above ^)

Heres an example of what the output should look like: If you pasted that into a ctrl F box on the website above, you'll see that it points to E022

Now, here is my question, and by extension the problem I am having:

Its not working. For some reason, it doesn't output the character itself, rather it just outputs a generic question mark that does not represent the private use char in question. If someone can help me with this it'd be much appreciated.

So far, I have had no luck.

1

There are 1 best solutions below

3
Basil Bourque On

tl;dr

  • Use correct Java syntax in your input string for a Unicode hexadecimal: \uXXXX
  • If you have no font providing a glyph for that code point number, your OS indicates the lack by displaying an empty-box, question-mark, or some such fall-back replacement.

To get an officially sanctioned Red Heart:

org.apache.commons.text.StringEscapeUtils.unescapeJava( "\\" + "u2764" + "\\" + "uFE0F" )  // Simulating some textual input of Java-syntax escaped Unicode code point numbers in hexadecimal.

❤️

Example code

You did not show your exact code. But your Question mentions u/XXXX which is incorrect. Correct syntax in Java for a Unicode hexadecimal is \uXXXX.

You can verify your hexadecimal literal by asking for the code point, as shown below.

Here is some example code.

System.out.println( "Demo of Private Use Area" );

String input = "\\" + "uE022";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );
int codePoint = output.codePointAt( 0 );
String name = Character.getName( codePoint );

Dump to console.

System.out.println( "input = " + input );
System.out.println( "output = " + output );
System.out.println( "codePoint = " + codePoint + " (we expect 57378 for \\uE022)." );
System.out.println( "Name = " + name );

When run:

Demo of Private Use Area
input = \uE022
output = 
codePoint = 57378 (we expect 57378 for \uE022).
Name = PRIVATE USE AREA E022

Red heart emoji

If you really want a red heart, Unicode does define an emoji.

But accessing this emoji requires two code points. Unicode 1.1 in 1993 defined “Heavy Black Heart” at decimal 10,084 (U+2764). Later versions of Unicode added Emoji 1.0 definitions in 2015, adding a definition for Red Heart by combining HEAVY BLACK HEART with VARIATION SELECTOR-16 at decimal 65,039 (U+FEOF).

See red heart row of Full Emoji List at the Unicode Consortium web site. However, that row appears to me to be incorrect in that it fails to mention the required U+FE0F code point.

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = Red Heart.
String input = "\\" + "u2764" + "\\" + "uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

❤️

Full example code:

System.out.println( "Demo of Red Heart" );

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = Red Heart.
String input = "\\" + "u2764" + "\\" + "uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

System.out.println( "input = " + input );
System.out.println( "output = " + output );

output.codePoints().forEachOrdered( ( int codePoint ) -> {
    String message =
            "Code point decimal " + codePoint
                    + " = hex " + Integer.toHexString( codePoint )
                    + " = name " + Character.getName( codePoint );
    System.out.println( message );
} );

When run:

Demo of Red Heart
input = \u2764\uFE0F
output = ❤️
Code point decimal 10084 = hex 2764 = name HEAVY BLACK HEART
Code point decimal 65039 = hex fe0f = name VARIATION SELECTOR-16

A PUA has no officially assigned characters

By definition, a Private Use Area (PUA) has no characters assigned by the Unicode Consortium. All the code point numbers in that range are promised by the Unicode Consortium to never be officially assigned any character.

These leaves all of us free to create a font that assigns any kind of glyph we want to assign to any of those code points.

You may want to create a font with red heart cartoon at code point E022. Meanwhile I may choose to make a font that has a drawing of a cockatiel. And some guy named Bob creates his own font with a picture of a Microlino car at E022. All of us, you, me, and Bob, are all happy knowing that our custom font will never be stomped on by a future officially sanctioned character at that code point.

If Alice likes your red heart, and wants to use it, she needs to obtain a copy of your font. She needs to install that font on her computer. And she needs to either:

  • Ensure that no enter font provides a glyph at code point E022, or,
  • Use an app that allows her to specify the use of your font rather than any other font that may also coincidentally provide a glyph at E022.

If Alice has installed no fonts at all with a glyph at E022, then the operating system of her computer will fall back to displaying some kind of substitute glyph such as an empty box or question mark or nothing to indicate the lack of a glyph.

The three PUAs defined in Unicode have turned out to be rather popular. People use them to create fonts for characters that do not meet the requirements of the Unicode Consortium, preventing those characters from ever being considered for future inclusion in Unicode. For example, fictional languages such as Klingon in Star Trek or elves’ language from novels.

This popularity has prompted volunteers outside the Unicode Consortium to devise a public registry of the PUA code points, in an attempt to avoid conflicts among various fonts over particular code points.