W3C Document.parse() produces unwanted elements

45 Views Asked by Mario P. Waxenegger At 24 August 2023 at 15:55

I just made up a simple XML file:

<main>
    <foo>
        <bar>1</bar>
        <bar>2</bar>
        <bar>3</bar>
    </foo>
</main>

I just tried to count the bars inside foo with this code:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);

InputStream is = new FileInputStream(FILE_FORMATTED);
DocumentBuilder db = dbf.newDocumentBuilder();

Document doc = db.parse(is);
NodeList bars = doc.getElementsByTagName("foo").item(0).getChildNodes();

System.out.println(bars.getLength());

To my confusion i get 7 printed in the console. I then checked in the debugger what's inside the NodeList (screenshot appended). The parser apparently produces some Nodes from line breaks and whitespaces.

When I swapped the XML for a file without any formatting (and less readable), I got the expected result 3 printed. Here's the other XML:

<main><foo><bar>1</bar><bar>2</bar><bar>3</bar></foo></main>

I couldn't find anything about that behaviour. Is it not considered to be used on formatted XML files? Even dbf.setIgnoringElementContentWhitespace(true) didn't work.

Original Q&A

There are 1 best solutions below

Michael Kay On 24 August 2023 at 17:31

Whitespace-only text nodes in general are significant: consider

<para>A <adj>great</adj> <adj>green</adj> <noun>dragon</noun>.</para>

So if your application wants to treat whitespace as insignificant, you generally have to tell the parser explicitly to throw it away. Different XML APIs have different ways of doing this.

W3C Document.parse() produces unwanted elements

There are 1 best solutions below

Related Questions in JAVA

Related Questions in XML

Related Questions in W3C

Trending Questions

Popular # Hahtags

Popular Questions