R - Remove some nodes from XML file

419 Views Asked by At

I am stuck with this problem : I am using R. I would like to remove the parent nodes "uid", "seanceRef" and "sessionRef".

I tried with remove_node() but it does seem to work. How can I do that?

<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid>CRSANR5L16S2022E1N003</uid>
  <seanceRef>RUANR5L16S2022IDS26199</seanceRef>
  <sessionRef>SCR5A2022E1</sessionRef>
  <metadonnees> 
    I want to keep metadonnees
  </metadonnees>
  </compteRendu>

2

There are 2 best solutions below

1
Allan Cameron On BEST ANSWER

You are not including the xml namespace prefix in your xpath.

If we look at your original document:

library(xml2)

doc <- read_xml('test.xml')

doc
#> {xml_document}
#> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
#> [1] <uid>CRSANR5L16S2022E1N003</uid>
#> [2] <seanceRef>RUANR5L16S2022IDS26199</seanceRef>
#> [3] <sessionRef>SCR5A2022E1</sessionRef>
#> [4] <metadonnees>\n    I want to keep metadonnees\n  </metadonnees>

Then we see the xml namespace is defined on the second line. All nodes belonging to this namespace have to be referred to by their namespace prefix, otherwise they will not be found:

xml_find_all(doc, '//uid')
#> {xml_nodeset (0)}

The default prefix is d1, but we can check what it is by doing:

xml_ns(doc)
#> d1 <-> http://schemas.assemblee-nationale.fr/referentiel

So we can get the node(s) we want by doing:

remove_me <- xml_find_all(doc, '//d1:uid')

remove_me
#> {xml_nodeset (1)}
#> [1] <uid>CRSANR5L16S2022E1N003</uid>

And to remove this node we can do:

xml_remove(remove_me)

doc
#> {xml_document}
#> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
#> [1] <seanceRef>RUANR5L16S2022IDS26199</seanceRef>
#> [2] <sessionRef>SCR5A2022E1</sessionRef>
#> [3] <metadonnees>\n    I want to keep metadonnees\n  </metadonnees>

Depending on your use case, you may find it easier to strip the namespace from your xml altogether to make the xpath easier to work with:

doc <- read_xml('test.xml')
xml_ns_strip(doc)
xml_find_all(doc, '//uid')
#> {xml_nodeset (1)}
#> [1] <uid>CRSANR5L16S2022E1N003</uid>
0
Parfait On

For OP or future readers, consider also XSLT, the special-purpose language designed to transform XML files, especially if logic to remove nodes become complex and conditional. You can run XSLT 1.0 scripts with the xslt package (sister to xml2). Do note: XSLT is an industry language so is portable to other languages (Java, Python, PHP) and executables.

Specifically, run the Identity Transform template to copy XML as is and then write an empty template on nodes to remove. Of course use a temporary prefix to remap your default namespace to reference the nodes to remove.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
           xmlns:ref="http://schemas.assemblee-nationale.fr/referentiel">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
    </xsl:template>
    
    <!-- EMPTY TEMPLATE TO REMOVE ALL SUCH NODES -->
    <xsl:template match="ref:uuid|ref:seanceRef"/>
    
</xsl:stylesheet>

R

library(xml2)
library(xslt)

# READ XML AND XSLT
doc <- read_xml("Input.xml", package = "xslt")
style <- read_xml("Script.xsl", package = "xslt")

# RUN TRANSFORMATION
doc <- xml_xslt(doc, style)
           
# SAVE EDITED DOC TO FILE           
output <- write_xml(doc, "Output.xml")