differentiating between empty textnode and textnode with whitespaces

191 Views Asked by At

While validating an xml file, I want to logg any text-node with empty content. A newline \n is also considered a texnode but it is not what I want to aprove. In the following code: 'parent' has two textnodes of content '\n' that are not interesting to me. The content of 'elem1' is '\n\n', which is an error and must be reported. 'elem2' has a valid content. Content of 'books' is empty and must be reported.

In my first try I searched each text-node for [\n\t\r] and would ignore them. But this way I would also ignore elem1 which should have been reported as error.

What is the point I am doing wrong? (notice: I have to solve this issue without xsd-validation)

Update 1): I have added more \n between the elements. Now the first 'parent' node has 5 textnodes with content: \n

<root>

    <parent>

        <elem1>

        </elem1> 

        <elem2>good content of el2</elem2>

        <elem3> half so good
               contentof el3</elem3>
    </parent>
    
    <parent>
        <elem1>
        </elem1> 

        <elem2>good content</elem2>
        <elem3>good</elem3>

        <elem4></elem4>

    </parent>

    <book></book>
    

</root>

Update 2) for more clearness: wenn a caller calls say validate("//parent/*"), I gather all nodes of this given path and get a nodelist returned. Then I start the validation for each node and its children.

Nodelist result = xpathinstance.validate(path, currentNode, XPathConstants.NODESET)

for (int n = 0; n < result.getLength(); n++) {

            validateThereAreNoGaps(result.item(n));
        }

Wenn I arive on the first 'parent'-element it shows 7 children (after update of example). Each \n between the element-tags is considered a text-node.

As a next solution I am now trying to replace all \n with "" to get rid of them...

1

There are 1 best solutions below

0
Johnson On

Here's a short expression that might help you:

<(\w+?>)[^\S]*<\/\1

This will select any text node that is empty.

If you don't want to select the tags, just use this:

<(?<=(\w+?>))[^\S]*(?=<\/\1)

However this second one cannot identify:

<books></books>

for example, but in that case I suggest simply using:

><

as your expression to find those seperately.