Description of the problem..
Apache.commons-text version 1.10.0 throws an IllegalArgumentException when it encounters an input that is not valid. Is this the expected behaviour.. if so the documentation doesn't mention it.
Whereas, org.apache.commons.lang.StringEscapeUtils.unescapeHtml would return a result as mentioned in the documentation.
String test = "test �";
org.apache.commons.text.StringEscapeUtils.unescapeHtml4(test)
throws java.lang.IllegalArgumentException
at java.lang.Character.toChars(Character.java:5172)
at org.apache.commons.text.translate.NumericEntityUnescaper.translate(NumericEntityUnescaper.java:146)
The Java doc for StringEscapeUtils mentions the following: the entity is left alone if unrecognised.
public static final String unescapeHtml4(String input)
Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.
For example, the string "<Français>" will become "<Fran�ais>"
If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. ">&zzzz;x" will become ">&zzzz;x".
Parameters:
input - the String to unescape, may be null
Returns:
a new unescaped String, null if null string input
There is a bug created for this https://issues.apache.org/jira/browse/LANG-1056;
I could consider the following to get the above working without exception, the result would be similar to org.apache.commons.lang.StringEscapeUtils.unescapeHtml; Could this cause any issues.. any suggestions are welcome.
String translate = org.apache.commons.text.StringEscapeUtils.ESCAPE_HTML4.translate(test);
String commonTextUnescapedHtml= org.apache.commons.text.StringEscapeUtils.unescapeHtml4(translate);
// test �
String commonLangUnescapedHtml= org.apache.commons.lang.StringEscapeUtils.unescapeHtml(test);
System.out.println(commonTextUnescapedHtml.equals(commonLangUnescapedHtml));