W3C Document.parse() produces unwanted elements

45 Views Asked by At

I just made up a simple XML file:

<main>
    <foo>
        <bar>1</bar>
        <bar>2</bar>
        <bar>3</bar>
    </foo>
</main>

I just tried to count the bars inside foo with this code:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);

InputStream is = new FileInputStream(FILE_FORMATTED);
DocumentBuilder db = dbf.newDocumentBuilder();

Document doc = db.parse(is);
NodeList bars = doc.getElementsByTagName("foo").item(0).getChildNodes();

System.out.println(bars.getLength());

To my confusion i get 7 printed in the console. I then checked in the debugger what's inside the NodeList (screenshot appended). The parser apparently produces some Nodes from line breaks and whitespaces.

When I swapped the XML for a file without any formatting (and less readable), I got the expected result 3 printed. Here's the other XML:

<main><foo><bar>1</bar><bar>2</bar><bar>3</bar></foo></main>

I couldn't find anything about that behaviour. Is it not considered to be used on formatted XML files? Even dbf.setIgnoringElementContentWhitespace(true) didn't work.

enter image description here

1

There are 1 best solutions below

0
Michael Kay On

Whitespace-only text nodes in general are significant: consider

<para>A <adj>great</adj> <adj>green</adj> <noun>dragon</noun>.</para>

So if your application wants to treat whitespace as insignificant, you generally have to tell the parser explicitly to throw it away. Different XML APIs have different ways of doing this.