Jsoup does not escape special characters (<, >) inside quoted Element attribute values

89 Views Asked by At

I got a surprise result when I tried to programmatically build an Element with attribute with value that contains < and >.

Am I missing something? https://www.w3.org/TR/xml/#syntax

The test:

Element e = new Element("E");
e.attr("key", "value");
e.attr("code", "<X>");
assertEquals("<E key=\"value\" code=\"&lt;X&gt;\"></E>", e.outerHtml());

fails:

expected: <<E key="value" code="&lt;X&gt;"></E>> but was: <<E key="value" code="<X>"></E>>
1

There are 1 best solutions below

0
Jonathan Hedley On

jsoup defaults to HTML output syntax, not XML. In HTML in a quoted attribute string, neither < nor > needs to be quoted, so jsoup doesn't.

If you want XML output, you can set OutputSettings to XML.

As you're using new Element(), that element won't have a parent or a document until you insert it into the DOM.

So for example:

Element e = new Element("E");
e.attr("key", "value");
e.attr("code", "<X>");
print("HTML", e.outerHtml());


Document doc = Document.createShell("https://example.com");
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
doc.body().appendChild(e);

print("XML", e.outerHtml());

Gives:

HTML: <E key="value" code="<X>"></E>
XML:  <E key="value" code="&lt;X>"></E>

Note that in neither HTML nor XML does the > character in a quoted attribute need to be escaped, so jsoup doesn't do that.