jsoup.clean replaces single quotes with double quotes. How can I prevent that?

50 Views Asked by At

For our application, we allow the user to enter HTML, that later is displayed for other users.

For security reasons, we sanitize this HTML and do not allow the user to save HTML, that might not be secure to run in other users browsers.

To do so, we have these two methods (simplified, they do much more, but this is the essence of this problem):

public static String sanitizeHTML(String html) {
        return Jsoup.clean(html,
                "",
                Safelist.relaxed()
                        .addAttributes(":all", "style")
                        .addAttributes(":all", "class"),
                new Document.OutputSettings().prettyPrint(false));
    }

and

    public static boolean isHTMLStringPolluted(String html) {
         return !sanitizeHTML(html).equals(html);
    }

We use isHTMLStringPolluted to validate the user input.

If a user now enters <a href="https://www.stackoverflow.com">Link</a> it's totally fine. If the user enters <a href='https://www.stackoverflow.com'>Link</a> the method returns false, because the sanitizeHTML method returns <a href="https://www.stackoverflow.com">Link</a>.

This is just one of the most simplistic examples of this issue. The users are able to add HTML which is much more complex. Furthermore, they do not just have an HTML editor, these HTML snippets can be created, calculated and concatenated through a very complex mathematical language (comparable to the Excel formulas) which uses data, variables and other HTML output throughout the whole application. This HTML is just the result.

We do not want to force the users to only use double quotes and we also do not want to replace the single quotes with double quotes, to ensure the original user input is either completely accepted or refused.

Is there a way to configure jsoup in a way to keep the quotations the way they are?

I also used different libraries like OWASP Java HTML Sanitizer, but it had many more restrictions and flaws that didn't fit our requirements.

1

There are 1 best solutions below

0
Jonathan Hedley On

There is no setting in jsoup that will retain single quotes around attribute values.

I would approach the problem differently - rather than trying to compare exact input and output equality, use the Cleaner.isValidBodyHtml() method. That checks that the input HTML did not contain any parse errors, and that the Cleaner did not remove any elements or attributes for not being in the Safelist.

Regardless of that method's output, you should only use/persist the cleaned output, not the original input.

Personally, I would generally use isValid() as a UI indicator, and not block on its basis -- as the cleaned output will be safe and with balanced tags, regardless of the input. No reason to block e.g. just because the user missed a closing tag. But that behaviour obviously depends on your exact use case.

Here's a worked example:

Safelist safelist = Safelist.relaxed()
    .addAttributes(":all", "style")
    .addAttributes(":all", "class");
Cleaner cleaner = new Cleaner(safelist);

String input = "<a href='https://www.stackoverflow.com'>Link</a>";
boolean isValid = cleaner.isValidBodyHtml(input);
print("isValid?", String.valueOf(isValid));

Document cleanDoc = cleaner.clean(Jsoup.parse(input));
print("Cleaned", cleanDoc.body().html());

Gives:

isValid?: true
Cleaned: <a href="https://www.stackoverflow.com">Link</a>

Also with the input of

String input = 
 "<a href='https://www.stackoverflow.com' class=ok>Link<script>xss()</script>";

Gives:

isValid?: false
Cleaned: <a href="https://www.stackoverflow.com" class="ok">Link</a>