Issue with Apache commons-text throwing IllegalArgumentException on StringEscapeUtils::unescapeHtml4

53 Views Asked by At

Description of the problem..

Apache.commons-text version 1.10.0 throws an IllegalArgumentException when it encounters an input that is not valid. Is this the expected behaviour.. if so the documentation doesn't mention it.

Whereas, org.apache.commons.lang.StringEscapeUtils.unescapeHtml would return a result as mentioned in the documentation.

String test = "test �";
org.apache.commons.text.StringEscapeUtils.unescapeHtml4(test)

throws java.lang.IllegalArgumentException
at java.lang.Character.toChars(Character.java:5172)
    at org.apache.commons.text.translate.NumericEntityUnescaper.translate(NumericEntityUnescaper.java:146)

The Java doc for StringEscapeUtils mentions the following: the entity is left alone if unrecognised.

public static final String unescapeHtml4(String input)

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

For example, the string "&lt;Fran&ccedil;ais&gt;" will become "<Fran�ais>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. "&gt;&zzzz;x" will become ">&zzzz;x".

Parameters:
    input - the String to unescape, may be null
Returns:
    a new unescaped String, null if null string input 

There is a bug created for this https://issues.apache.org/jira/browse/LANG-1056;

I could consider the following to get the above working without exception, the result would be similar to org.apache.commons.lang.StringEscapeUtils.unescapeHtml; Could this cause any issues.. any suggestions are welcome.

String translate = org.apache.commons.text.StringEscapeUtils.ESCAPE_HTML4.translate(test);
       
String commonTextUnescapedHtml= org.apache.commons.text.StringEscapeUtils.unescapeHtml4(translate);

// test &#39511154;

String commonLangUnescapedHtml= org.apache.commons.lang.StringEscapeUtils.unescapeHtml(test); 
System.out.println(commonTextUnescapedHtml.equals(commonLangUnescapedHtml));
0

There are 0 best solutions below