For our application, we allow the user to enter HTML, that later is displayed for other users.
For security reasons, we sanitize this HTML and do not allow the user to save HTML, that might not be secure to run in other users browsers.
To do so, we have these two methods (simplified, they do much more, but this is the essence of this problem):
public static String sanitizeHTML(String html) {
return Jsoup.clean(html,
"",
Safelist.relaxed()
.addAttributes(":all", "style")
.addAttributes(":all", "class"),
new Document.OutputSettings().prettyPrint(false));
}
and
public static boolean isHTMLStringPolluted(String html) {
return !sanitizeHTML(html).equals(html);
}
We use isHTMLStringPolluted to validate the user input.
If a user now enters <a href="https://www.stackoverflow.com">Link</a> it's totally fine.
If the user enters <a href='https://www.stackoverflow.com'>Link</a> the method returns false, because the sanitizeHTML method returns <a href="https://www.stackoverflow.com">Link</a>.
This is just one of the most simplistic examples of this issue. The users are able to add HTML which is much more complex. Furthermore, they do not just have an HTML editor, these HTML snippets can be created, calculated and concatenated through a very complex mathematical language (comparable to the Excel formulas) which uses data, variables and other HTML output throughout the whole application. This HTML is just the result.
We do not want to force the users to only use double quotes and we also do not want to replace the single quotes with double quotes, to ensure the original user input is either completely accepted or refused.
Is there a way to configure jsoup in a way to keep the quotations the way they are?
I also used different libraries like OWASP Java HTML Sanitizer, but it had many more restrictions and flaws that didn't fit our requirements.
There is no setting in jsoup that will retain single quotes around attribute values.
I would approach the problem differently - rather than trying to compare exact input and output equality, use the
Cleaner.isValidBodyHtml()method. That checks that the input HTML did not contain any parse errors, and that the Cleaner did not remove any elements or attributes for not being in the Safelist.Regardless of that method's output, you should only use/persist the cleaned output, not the original input.
Personally, I would generally use
isValid()as a UI indicator, and not block on its basis -- as the cleaned output will be safe and with balanced tags, regardless of the input. No reason to block e.g. just because the user missed a closing tag. But that behaviour obviously depends on your exact use case.Here's a worked example:
Gives:
Also with the input of
Gives: