Merging XML document hierarchies

200 Views Asked by At

Background

I'm designing a Perl application that uses XML files as inputs for config and settings information. There will be a hierarchy of documents, with global data overridden by more local information.

My program will be invoked with the most local settings file which will contain paths to more general files. Some local settings will be absolute, and which these are would be hard coded in the program.

The initialization task is to get the settings for an invocation from the highest level, reading them in and then going on to each level and merge/join them as a single XML document.

Sample Data

Global_layouts_100.xml

<CONFIG>
    <GRP1>
        <FIELD foo="abs" format="%.4f">QTY</FIELD>
        <FIELD default="" format="%.2f">COST</FIELD>
        <FIELD default="0" format="%.2f">AMT</FIELD>
        <FIELD default="1960-01-01" format="YYYMMDD">TRANDATE</FIELD>
        <FIELD>ACCOUNT</FIELD>
        <FIELD default="0">ACCT_TYPE</FIELD>
    </GRP1>
    <GRP2>
        <FIELD> 1 </FIELD>
        <FIELD> 2 </FIELD>
        <FIELD> 3 </FIELD>
    </GRP2>
</CONFIG>

Global_properties_100.xml

<CONFIG>
    <CUS>
        <GRP>GRP1</GRP>
        <HDR>CUSTOMER</HDR>
        <TLR>TLR${cnt}</TLR>
    </CUS>
    <XYZ>
        <GRP>GRP2</GRP>
        <HDR>ACCOUNTS</HDR>
        <TLR>TLR${cnt}</TLR>
    </XYZ>
</CONFIG>

Global_70.xml

<CONFIG>
<PARENT_SETTINGS>Global_layouts_100</PARENT_SETTINGS>
<PARENT_SETTINGS>Global_properties_100</PARENT_SETTINGS>
    <LOOKUPS>
        <MAP type="file">
            <NAME>ACCT_TYPE_LOOKUP</NAME>
            <PATH>${PATH}acct_type.csv</PATH>
            <HEADERS>
                <COLUMN>ACCT_TYPE</COLUMN>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </HEADERS>
            <KEYS>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </KEYS>
        </MAP>
    </LOOKUPS>
</CONFIG>

local.xml

<CONFIG>
    <PARENT_SETTINGS>Global_70</PARENT_SETTINGS>
    <BATCH>
        <CUS>
            <SRCFILE type="csv" delimiter="|">/path/to/src_file</SRCFILE>
            <OUTFILE>/path/to/out_file</OUTFILE>
            <FIELDS>
                <CUSTOMER>&CUSTOMER;</CUSTOMER>
                <QTY default="0.0" col="23"></QTY>
                <COST format="%.4f" col="21"></COST>
                <FEE col="18"></FEE>
            </FIELDS>
        </CUS>
        <XYZ>
            <SRCFILE />
            <OUTFILE />
            <FIELDS>
                <FIELD_1 />
                <FIELD_2 />
                <FIELD_3 />
                <FIELD_4 />
                <FIELD_5 />
            </FIELDS>
        </XYZ>
    </BATCH>
</CONFIG>

Now, if the program would be given the local.xml to start and CUS as an arg to process I'd like to see this XML (or equivalant perl data structure):

<CONFIG>
    <HDR>CUSTOMER</HDR>
    <TLR>TLR${cnt}</TLR>
    <SRCFILE type="csv" delimiter="|">/path/to/src_file</SRCFILE>
    <OUTFILE>/path/to/out_file</OUTFILE>
    <LOOKUPS>
        <MAP type="file">
            <NAME>ACCT_TYPE_LOOKUP</NAME>
            <PATH>${PATH}acct_type.csv</PATH>
            <HEADERS>
                <COLUMN>ACCT_TYPE</COLUMN>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </HEADERS>
            <KEYS>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </KEYS>
        </MAP>
    </LOOKUPS>
    <CUS>
        <FIELD foo="abs" format="%.4f" default="0.0" col="23">QTY</FIELD>
        <FIELD default="" format="%.4f" col="21">COST</FIELD>
        <FIELD default="0" format="%.2f">AMT</FIELD>
        <FIELD default="1960-01-01" format="YYYMMDD">TRANDATE</FIELD>
        <FIELD>ACCOUNT</FIELD>
        <FIELD default="0">ACCT_TYPE</FIELD>
        <FIELDS>
            <CUSTOMER>&CUSTOMER;</CUSTOMER>
            <QTY default="0.0" col="23"></QTY>
            <COST format="%.4f" col="21"></COST>
            <FEE col="18"></FEE>
        </FIELDS>
    </CUS>
</CONFIG>

And, if the program would be given the local.xml to start and XYZ as an arg to process I'd like to see this XML (or equivalant perl data structure):

<CONFIG>
    <HDR>ACCOUNTS</HDR>
    <TLR>TLR${cnt}</TLR>
    <SRCFILE />
    <OUTFILE />
    <LOOKUPS>
        <MAP type="file">
            <NAME>ACCT_TYPE_LOOKUP</NAME>
            <PATH>${PATH}acct_type.csv</PATH>
            <HEADERS>
                <COLUMN>ACCT_TYPE</COLUMN>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </HEADERS>
            <KEYS>
                <COLUMN>SOURCE_VALUE</COLUMN>
            </KEYS>
        </MAP>
    </LOOKUPS>
    <XYZ>
        <FIELD> 1 </FIELD>
        <FIELD> 2 </FIELD>
        <FIELD> 3 </FIELD>
        <FIELDS>
            <FIELD_1 />
            <FIELD_2 />
            <FIELD_3 />
            <FIELD_4 />
            <FIELD_5 />
        </FIELDS>
    </XYZ>
</CONFIG>

Question

What is the most efficient way of merging these XML documents?

I can do it myself with the data structures returned by XML::Simple, or maybe there are some other XML tools I should use?

I hope my question is clear enough and does not need sample XML data. If you need to see something then I can post some sample stuff.

The question in brief is, what is the best way to merge a hierarchy of individual XML documents?

2

There are 2 best solutions below

8
Parfait On

Preface: I know nothing of Perl but one option is to use XSLT, a declarative special-purpose language to style/transform XML documents.

And I do know, most languages like PHP (somewhat a Perl descendent), Python, Java, C#, etc. maintain XML libraries and likewise XSLT transformation. So, consider applying a Perl XSLT processor where you use XSLT file to merge documents (which you can specify particular nodes)

Using your sample data, below stylesheets would render your final XML structures for CUS and XYZ. Be sure to keep all derivative XMLs in same directory.

CUS VERSION

<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 

 <xsl:template match="CONFIG">

    <xsl:copy> 
         <xsl:copy-of select="document('Global_properties_100.xml')/CONFIG/CUS/HDR" />
         <xsl:copy-of select="document('Global_properties_100.xml')/CONFIG/CUS/TLR" />
         <xsl:copy-of select="BATCH/CUS/SRCFILE" />
         <xsl:copy-of select="BATCH/CUS/OUTFILE" />
         <xsl:copy-of select="document('Global_70.xml')/CONFIG/LOOKUPS" />
         <CUS>
            <xsl:copy-of select="document('Global_layouts_100.xml')/CONFIG/GRP1/*" />
            <xsl:copy-of select="BATCH/CUS/FIELDS" />
         </CUS>
    </xsl:copy>

 </xsl:template> 

</xsl:transform>

XYZ VERSION

<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 

 <xsl:template match="CONFIG">

    <xsl:copy> 
         <xsl:copy-of select="document('Global_properties_100.xml')/CONFIG/XYZ/HDR" />
         <xsl:copy-of select="document('Global_properties_100.xml')/CONFIG/XYZ/TLR" />
         <xsl:copy-of select="BATCH/XYZ/SRCFILE" />
         <xsl:copy-of select="BATCH/XYZ/OUTFILE" />
         <xsl:copy-of select="document('Global_70.xml')/CONFIG/LOOKUPS" />
         <CUS>
            <xsl:copy-of select="document('Global_layouts_100.xml')/CONFIG/GRP2/*" />
            <xsl:copy-of select="BATCH/XYZ/FIELDS" />
         </CUS>
    </xsl:copy>

 </xsl:template> 

</xsl:transform>
0
Sobrique On

I can give you a more specific example with some sample data, but when approaching this I tend to use XML::Twig.

Specifically - XML::Twig has built in support for cut and paste so you can build a new document tree, and preserve the elements you want, in the order I want.

Something like this:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> parse ( \*DATA );

my $newdoc = XML::Twig -> new ('pretty_print' => 'indented_a');
$newdoc -> set_root ( XML::Twig::Elt -> new ( 'new_root_here' ) );
$newdoc -> set_xml_version ('1.0');
$newdoc -> set_encoding('utf-8'); 

foreach my $value_elt ( $twig -> findnodes ( '//value' ) ) {
    $value_elt -> cut;
    $value_elt -> paste ( $newdoc -> root );
}


$newdoc -> print;

__DATA__
<root>
   <value>fish</value>
   <dont_copy>this thing</dont_copy>
</root>

(There's another example: How to I combine data from two XML files into the same structure?)