Should consecutive CDATA sections in XML be merge in a way transparent to the application?

101 Views Asked by At

I am writing a quick&dirty XML parser and generator with high level interface, in C++.

Actual node content can be escaped basically in two ways: using escapes for [<>'"&] or using a CDATA section, the two options in my understanding should be equivalent, i.e. the following two should be perfectly equivalent:

<foo>It &#39;s fun &lt;&gt; &amp;h?</foo>
and
<foo><![CDATA[It's fun <> &h?]]></foo>

The latter is often better, as often more compact and more readable, however it cannot contain a ']]>' sequence unless using the known trick:

<foo>This contains a ]]>, sadly</foo>
is the same as
<foo><![CDATA[This contains a ]]]]><![CDATA[>, sadly]]></foo>

But this, formally, is TWO CDATA sections:

<![CDATA[This contains a ]]]]>

and

<![CDATA[>, sadly]]>

Most applications actually "merge" them so that

<![CDATA[XXX]]><![CDATA[YYY]]>

is completely undistinguishable from

<![CDATA[XXXYYY]]>

This is not the case, of course, if some actual XML element is in the middle, like:

<![CDATA[bar]]><br /><![CDATA[baz]]>

My question is: as long as there are no PARSED XML elements inbetween, would you consider legit to merge consecutive CDATA sections transparently in the parser, completely hiding to the application the fact that they were not a single CDATA section?

Please notice that this is the only way to "revert" the form of escaping done by many for ]]> in CDATA:

]]>     ->     ]]]]><[!CDATA[>

which actually splits the section.

Would you consider better a parser which, when XML is

<foo><![CDATA[XXX]]><![CDATA[YYY]]></foo>

reports "The node foo contains CDATA 'XXX' and then CDATA 'YYY'" or "The node foo contains data "XXXYYY"?

Consider that in one case we are inconsistent with the actual content of the XML.

I am wondering if there is some official reference which can help, because the thing gets even more complicated with comments and other stuff.

Take this XML:

<foo>
  one
  <br />
  two<!--Comment-->three<br />
  <![CDATA[four]]><![CDATA[five]]>
  <![CDATA[six]]><br /><![CDATA[seven]]>
  <![CDATA[eight ]]]]><![CDATA[> nine]]>
</foo>

Would you expect, at the application level, to get that the XML contains:

data("one"), tag("br"), data("two"), data("three"), data("four"), data("five"), data("six"), tag("br"), data("seven"), data("eight]]"), data("<nine")

or

data("one"), tag("br"), data("twothree"), data("fourfivesix"), tag("br"), data("seveneight]]<nine")

or... something inbetween?

Note that I have already taken the decision that, since the documentation states that they are EQUIVALENT, my library will not report upstream in any way if the XML was <foo>&lt&gt&lt&gt&lt&gt</foo> or <foo><![CDATA[<><><>]]></foo>.

My opinion is that splitting the data on comments is reasonable (because comments were inserted knowing what one was doing), splitting it when XML tags are in the middle is obviously due (because for foo<br />bar you MUST know in the application that the BReak was between "foo" and "bar") but splitting the data for consecutive CDATA without anything inbetween is wrong, as that is likely done when "escaping" a CDATA that contains the infamous ]]> sequence.

A further practical solution could be to provide two was to access the parser's result: one suited for document processing using the "split everything" approach above (the first one) and one suited for data processing (i.e. SOAP) in which I merge ALL the data content of a node, even ignoring the fact that there was some actual parsed TAG in the middle. This would be very effective, but seems quite against XML phylosophy to me.

I Know that he application using the library should act trating consecutive CDATA or DATA sections in the same manner, wether they are merged or not; but I am wondering about what is better to do in the parsing library.

Opinions and references to documentation I am missing are welcome.

1

There are 1 best solutions below

3
Michael Kay On

The closest to a definitive statement about what the XML parser must report to the application is appendix B of the InfoSet specification at https://www.w3.org/TR/xml-infoset/#reporting - which doesn't mention CDATA sections.

The Infoset is a data model for XML, and it provides one view as to which lexical distinctions in an XML document are significant and which aren't. In the Infoset model, characters within CDATA sections are indistinguishable from characters outside CDATA sections: so the answer to the question here, do you report START-TEXT-END, or START-TEXT-END-START-TEXT-END, is that you report neither, all you have to report is the sequence of characters, regardless of the CDATA boundaries.

The XDM model used by XPath, XSLT, and XQuery shares that view; CDATA boundaries are of no relevance.

There are people who care about CDATA boundaries and want to be told about them; but I don't think their view is supported by any official specification. So I would say, do what you like and see if your users complain.

The SAX specification in Java doesn't report CDATA sections to a ContentHandler, but if you're interested in low-level detail then you can nominate a LexicalHandler which will receive notification of the start and end of CDATA sections. That's a reasonable design compromise. I would assume it reports the boundaries as they actually appear.