jsoup output wrong HTML when < exists inside text

179 Views Asked by At

Input html is

<p>猫<虎</p>

Which can be displayed by Chrome as 猫<虎

But when you use jsoup to parse the html, then output html is

<p>猫
  <虎 < p>
  </虎<></p>

How can I fix this problem without modify the

< to &lt;
1

There are 1 best solutions below

1
tucuxi On BEST ANSWER

Why do you think that jsoup is "wrong" and chrome is "right"? An < that is not part of a tag should always be escaped as &lt; (because it will otherwise be interpreted as opening a tag) - fix that, and all standards-compliant html tools will agree on the same parsing. Do not fix it, and some may disagree. In this case, JSoup is accepting non-alphanumerics as tag-name, which is invalid. But it encountered an unescaped < that was not part of a tag-name!

If you insist on not changing the source html, you can simply pre-process it before feeding it into JSoup:

 // before 
 Document doc = Jsoup.parse(html);

 // with pre-processing
 Document doc = Jsoup.parse(fixOutOfTagLessThan(html));

where

 /**
  * Replaces not-in-tag `<` by `&lt;`, but WILL FAIL in 
  * many cases, because it is unaware of:
  * - comments (<!--)
  * - javascript
  * - the fact that you should NOT PARSE HTML WITH REGEX
  */
 public static void fixOutOfTagLessThan(String html) {
    return html.replaceAll("<([^</>]+)<", "&lt;$1<");
 }

Chrome appears to be applying HTML5 parse logic to treat the < as text (since it is not part of a valid tag name) - however, as I understand it, it should reject everything up to the >, and then issue a missing </p>. So, to my eyes, it does not appear to follow the standard fully either.