Input html is
<p>猫<虎</p>
Which can be displayed by Chrome as 猫<虎
But when you use jsoup to parse the html, then output html is
<p>猫
<虎 < p>
</虎<></p>
How can I fix this problem without modify the
< to <
Input html is
<p>猫<虎</p>
Which can be displayed by Chrome as 猫<虎
But when you use jsoup to parse the html, then output html is
<p>猫
<虎 < p>
</虎<></p>
How can I fix this problem without modify the
< to <
Copyright © 2021 Jogjafile Inc.
Why do you think that jsoup is "wrong" and chrome is "right"? An
<that is not part of a tag should always be escaped as<(because it will otherwise be interpreted as opening a tag) - fix that, and all standards-compliant html tools will agree on the same parsing. Do not fix it, and some may disagree. In this case, JSoup is accepting non-alphanumerics as tag-name, which is invalid. But it encountered an unescaped<that was not part of a tag-name!If you insist on not changing the source html, you can simply pre-process it before feeding it into JSoup:
where
Chrome appears to be applying HTML5 parse logic to treat the
<as text (since it is not part of a valid tag name) - however, as I understand it, it should reject everything up to the>, and then issue a missing</p>. So, to my eyes, it does not appear to follow the standard fully either.