How to split a single XML files into multiple XML files based on a given tag and renamed based that tag attribute?

Question

How to split a single XML files into multiple XML files based on a given tag and renamed based that tag attribute?

388 Views Asked by Dani At 18 July 2023 at 20:14

I have several big archives in XML that I need to split the main node into files, and use the node's title attribute as name, eg:

<book title="ABC123" year="2000">
  <description>Some sentences...</description>
  <img src="image/cover_ABC123" />
</book>

And export it as ABC123.xml

I found a script that partially solves my help request, but it follows a numbered sequence and export files as 01.xml, 02.xml etc.; I would need to adapt it to my case, but I can't figure how:

(source: https://stackoverflow.com/a/56889282/17486393)

#!/usr/bin/env bash
xmlfile=file.xml

n=$(xmlstarlet sel -t -v 'count(//ORDER)' file.xml)
for i in $(seq 1 $n); do
   xmlstarlet sel -t -m "//ORDER[${i}]" -c . $xmlfile > "File${i}.xml"
done

I tried to add this option: -e "title" to extract also the name:

#!/usr/bin/env bash
xmlfile=file.xml

n=$(xmlstarlet sel -t -v 'count(//book -e "title")' file.xml)
for i in $(seq 1 $n); do
   xmlstarlet sel -t -m "//ORDER book -e "title"[${i}]" -c . $xmlfile > "File${i}.xml"
done

But I get:

xsl:for-each : could not compile select expression '//product -e title [1]

I tried to use this one instead, but I didn't understand this as well:

(https://stackoverflow.com/a/36156617/17486393)

$ for ((i=1; i<=`xmlstarlet sel -t -v 'count(/root/row)'  1.xml`; i++)); do \
          echo '<?xml version="1.0" encoding="UTF-8"?><root>' > NAME.xml;
          NAME=$(xmlstarlet sel -t -m '/root/row[position()='$i']' -v './NAME' 1.xml); \
          xmlstarlet sel -t -m '/root/row[position()='$i']' -c . -n 1.xml >> $NAME.xml; \
          echo '</root>' >> NAME.xml
       done

And I changed into:

$ for ((i=1; i<=`xmlstarlet sel -t -v 'count(/root/book)'  books01.xml`; i++)); do \
          NAME=$(xmlstarlet sel -t -m '/root/book[position()='$i']' -v './NAME' books01.xml); \
          xmlstarlet sel -t -m '/root/book[position()='$i']' -c . -n books01.xml >> $NAME.xml;
       done

It doesn't produce any files...

If possible I'd like to use xmlstarlet.

Original Q&A

There are 2 best solutions below

urznow On 19 July 2023 at 15:58

If possible I'd like to use xmlstarlet

xmlstarlet supports the EXSLT exsl:document element which is used to create multiple result documents in an existing directory, for example:

# shellcheck  shell=sh
cat <<'HERE' | xmlstarlet transform \
  /dev/stdin -s outDir="${TMPDIR}" input.xml
<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl"
>
  <xsl:param name="outDir" select="'/tmp'"/>
  <xsl:template match="/">
    <xsl:for-each select="//book">
      <exsl:document 
        href="{concat($outDir,'/',translate(@title,'/','_'),'.xml')}" 
        method="xml" 
        omit-xml-declaration="yes"
        indent="no"
      >
        <xsl:copy-of select="."/>
      </exsl:document>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>
HERE

where

xmlstarlet transform is an XSLT processor
the XSLT stylesheet is read from stdin
the output directory can be passed on the command line, e.g. ${TMPDIR} or file://${TMPDIR}
any / (slash) characters in the book title are replaced with _ (underscore)
if book titles are not unique only the last will survive
the XPath 1.0 functions concat and translate are documented here and here

Or, if you're confident that lines in the input XML file are relatively short and therefore handled by standard text tools (hint: getconf LINE_MAX), you could have xmlstarlet select add a delimiter line before each XML <book> section and use awk to split the output into separate files:

# shellcheck  shell=sh
delim=$(uuidgen)
xmlstarlet select -t \
  -m '//book' \
    -o "${delim}$(printf '\t')${TMPDIR}/" -v 'translate(@title,"/","_")' -o '.xml' -n \
    -c '.' -n \
input.xml |
awk -F'[\t\n]' -v sep="${delim}" '$1==sep{close(out);out=$2;next;}{print>out;}'

**Michael Kay** · Accepted Answer · 2023-07-18T22:23:18.000000

In XSLT 2.0 or later:

<xsl:transform version="2.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>
      
   <xsl:template match="/">
      <xsl:for-each select="//book">
        <xsl:result-document href="{@title}.xml">
          <xsl:copy-of select="."/>
        </xsl:result-document>
      </xsl:for-each>
    </xsl:template>

</xsl:transform>

To execute this with SaxonJ from the command line (on one line):

java -jar dir/SaxonHE12-3J/saxon-he-12.3.jar 
  net.sf.saxon.Transform -s:input.xml -xsl:stylesheet.xsl 
  -o:out/output.xml -t

The resulting output.xml file will be essentially empty; the multiple files produced by xsl:result-document will be in the same directory as output.xml. The -t option logs each output file as it is written.

If you prefer a GUI tool, many popular XML editors have Saxon (or another XSLT 2.0 processor) integrated.

How to split a single XML files into multiple XML files based on a given tag and renamed based that tag attribute?

There are 2 best solutions below

Related Questions in XML

Related Questions in XML-PARSING

Related Questions in XMLSTARLET

Trending Questions

Popular # Hahtags

Popular Questions