... ... ...

Parsing an XML with missing content

67 Views Asked by At

I have a XML like this:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>

... with many missing arguments, but I would like to obtain a data.frame with a line for each "div" like the following one:

div time content
1 time1 content1
2 time2 NA
3 time3 content3
4 NA content4

with NA when the argument is missing.

I try an approach like this one

data_xml <- read_xml(xmlfile)
div <-xml_find_all(data_xml, xpath = ".//div")
df <- tibble::tibble(
  date = div %>% xml_text(),
  content = div %>% xml_find_first('./p[@rend="content"/hi[@rend="italic"]]') %>% xml_text()
)

but the xml_find_all does indeed return an empty list. Following some suggestions I try this way, actually working

doc <- htmlParse(xmlfile)

div <- getNodeSet(doc, '//div')
dates<- xpathSApply(doc,'//div/text()',xmlValue)
abstracts<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))

I correctly obtained the strings I wanted BUT I lost the correspondency, since many div have no content or no head with time information (meaning that div, dates, abstracts have different lengths). Any suggestions? TIA

1

There are 1 best solutions below

1
G. Grothendieck On BEST ANSWER

1) The input shown is malformed so read_xml will give an error. Since the question indicates it works there must have been a transcription error in moving the XML to the question. We have added a close div tag before the 4th opening div tag in the Note at the end.

Since the XML uses a namespace, first strip that using xml_ns_strip to avoid problems. Then form the appropriate xpath expression producing the needed nodes and convert those to dcf format (which is a name:value format where each field is on a separate line and a blank line separates records -- see ?read.dcf for details) in variable dcf. Read that using read.dcf, convert the resulting character matrix to data frame and fix up the div entries.

library(dplyr)
library (xml2)

doc <- read_xml(Lines) %>% xml_ns_strip() # Lines in Note below

nodes <- doc %>%
  xml_find_all('//div | //head[@rend="time"] | //hi[@rend="italic"]')

dcf <- case_match(xml_name(nodes),
  "div" ~ "\ndiv:",
  "hi" ~ paste0("time:", xml_text(nodes)),
  .default = paste0("content:", xml_text(nodes))
)

dcf %>%
  textConnection() %>%
  read.dcf() %>%
  as.data.frame() %>%
  mutate(div = row_number())

giving

  div   time  content
1   1 TIME_1 CONTENT1
2   2 TIME_2     <NA>
3   3 TIME_3 CONTENT3
4   4   <NA> CONTENT4

2) Another way is to use a double xml_find_all. The first creates a node set and the second creates a list of node sets, with one component per record because flatten=FALSE. These are then reformed into a data frame.

library(purrr)
doc %>%
  xml_find_all('//div') %>%
  xml_find_all(".//head | .//hi", flatten = FALSE) %>%
  map_df(~ setNames(xml_text(.x, TRUE), xml_name(.x))) %>%
  reframe(div = row_number(), time = head, content = hi)
## # A tibble: 4 × 3
##     div time   content 
##   <int> <chr>  <chr>   
## 1     1 TIME_1 CONTENT1
## 2     2 TIME_2 <NA>    
## 3     3 TIME_3 CONTENT3
## 4     4 <NA>   CONTENT4

3) This third alternative is a bit closer to the attempt in the question except it uses xml_find_first separately for each column.

column <- function(start, xpath) {
  start %>% xml_find_first(xpath) %>% xml_text(TRUE)
}

div_nodes <- doc %>% xml_find_all('//div')
tibble(div = seq_along(div_nodes),
       time = column(div_nodes, ".//head"),
       content = column(div_nodes, ".//hi")
) 
## # A tibble: 4 × 3
##     div time   content 
##   <int> <chr>  <chr>   
## 1     1 TIME_1 CONTENT1
## 2     2 TIME_2 <NA>    
## 3     3 TIME_3 CONTENT3
## 4     4 <NA>   CONTENT4

Note

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
</div>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>'