Extract XML data in tabular format

Question

Extract XML data in tabular format

190 Views Asked by M-- At 10 March 2023 at 17:36

I have an xml file which I want to extract data from. Ultimately, what I need is a table that shows the name of the node (i.e. NODE36 and NODE44) with the information in the tables (see the desired output below).

Is there a way to use regex or an XML parser to extract the data as a table?

<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <name>culverts.XML</name>
    <StyleMap id="m_ylw-pushpin29">
        <Pair>
            <key>normal</key>
            <styleUrl>#s_ylw-pushpin00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#s_ylw-pushpin_hl25</styleUrl>
        </Pair>
    </StyleMap>
<Folder>
        <name>culverts.XML</name>
        <open>1</open>
        <description>Culvert</description>
        <Placemark>
            <name>NODE36</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>1</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Continue to monitor.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.249045,45.997986,0</coordinates>
            </Point>
        </Placemark>
        <Placemark>
            <name>NODE44</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>2</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Not Available.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.24906300000001,45.998057,0</coordinates>
            </Point>
        </Placemark>
    </Folder>
</Document>

Desired Output:

Name	Objectid	On_route	Road_numbe	Recommenda
NODE36	1	Mid Turnpike	54	Continue to monitor.
NODE44	2	Mid Turnpike	54	Not Available.

I tried regex to extract the data between <Placemark> and </Placemark> to no avail;

library(qdapRegex)
my_tbl <- rm_between(file_str, 'Placemark', '/Placemark', extract=TRUE)[[1]]

or

my_tbl <- str_extract_all(file_str, "Placemark((.|\n)*)/Placemark")
Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
  Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)

I cannot get this one to work in R. Although even if I could, it matches the first occurrence of <Placemark> with the last occurrence of </Placemark>; see here: https://regex101.com/r/bQOdDJ/1

Original Q&A

There are 3 best solutions below

Onyambu On 10 March 2023 at 18:01

library(rvest)
library(tidyverse)

read_html(your_page, options = "HUGE")%>%
   html_node('table')%>%
   html_table(fill = TRUE) %>%
   mutate(row = cumsum(X1 =='Objectid'))%>%
   pivot_wider(names_from = X1, values_from = X2)%>%
   type.convert(as.is =TRUE)

# A tibble: 3 × 5
    row Objectid On_route     Road_numbe Recommenda          
  <int>    <int> <chr>             <int> <chr>               
1     1        1 Mid Turnpike         54 Continue to monitor.
2     2        2 Mid Turnpike         54 Not Available.

Michael Kay On 11 March 2023 at 00:23

Quite tricky because of the nested HTML, but here is an XSLT solution that uses the parse-html() function proposed for XSLT 4.0 and implemented in Saxon 12:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   version="4.0" expand-text="yes">
  
   <xsl:output method="html" indent="yes"/>  
   <xsl:variable name="tables">
      <xsl:for-each select="//Placemark">
         <data>
            <xsl:copy-of select="name"/>
            <description>
               <xsl:sequence select="parse-html(description)//*:table"/>
            </description>
         </data>
      </xsl:for-each>
   </xsl:variable>
   
   <xsl:template match="/">
      <table>
         <thead>
            <tr>
               <th>Name</th>
               <xsl:for-each select="$tables/data[1]//*:tr">
                  <th>{*:td[1]}</th>                 
               </xsl:for-each>
            </tr>
         </thead>
         <tbody>
            <xsl:for-each select="$tables/data">
               <tr>
                  <td>{name}</td>
                  <xsl:for-each select="description//*:tr">
                     <td>{*:td[2]}</td>
                  </xsl:for-each>
               </tr>
            </xsl:for-each>
         </tbody>
      </table>
   </xsl:template>
   
</xsl:stylesheet>

The output is:

<table>
   <thead>
      <tr>
         <th>Name</th>
         <th>Objectid</th>
         <th>On_route</th>
         <th>Road_numbe</th>
         <th>Recommenda</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>NODE36</td>
         <td>1</td>
         <td>Mid Turnpike</td>
         <td>54</td>
         <td>Continue to monitor.</td>
      </tr>
      <tr>
         <td>NODE44</td>
         <td>2</td>
         <td>Mid Turnpike</td>
         <td>54</td>
         <td>Not Available.</td>
      </tr>
   </tbody>
</table>

**MrFlick** · Accepted Answer · 2023-03-10T18:17:15.580000

Here's a method with a helper function to turn the HTML table into a data.frame. Basically we need to do a bunch of iterations and parsing of the HTML data.

library(xml2)
library(purrr)
doc <- xml2::read_xml(xx)

table_to_dataframe <- function(x) {
  x |> xml_find_all(".//tr") |> 
    map(function(x) {
      x |> xml_find_all("./td") |> xml_text()
    }) |>
    do.call("rbind", args=_) |>
    (function(x) setNames(x[,2], x[,1]))() |>
    bind_rows()
}

doc |>
  xml_find_all("//Placemark") |>
  map_df(function(p) {
    name <- p |> xml_find_first("./name") |> xml_text()
    sub <- p |> xml_find_first("./description") |> xml_text() |> read_html()
    bind_cols(tibble(name), table_to_dataframe(sub))
  })

Which returns

  name   Objectid On_route     Road_numbe Recommenda          
  <chr>  <chr>    <chr>        <chr>      <chr>               
1 NODE36 1        Mid Turnpike 54         Continue to monitor.
2 NODE44 2        Mid Turnpike 54         Not Available.

Extract XML data in tabular format

Desired Output:

There are 3 best solutions below

Related Questions in R

Related Questions in XML

Related Questions in XML-PARSING

Related Questions in XML2

Trending Questions

Popular # Hahtags

Popular Questions