I have an locally downloaded HTM file that I cannot parse more than one level deep in R. I think it's because everything past that first level is behind an external pointer. I cannot share the file here because it is sensitive information. But, my code basically looks like the following.
data_file <- "myfile.htm"
data_html <- read_html("myfile.htm")
data_html consists of a list with two parts, $node and $doc. Both of these are objects or classes.
Here are some things I have tried:
getNodeSet(data_html)
getNodeSet(data_html) Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "c('xml_document', 'xml_node')"
I have tried to read in the file with readLines and subsetting down to which lines have the word 'table' in them, but then the same issue as above arises, where methods I apply don't like the fact it's an xml document, xml node, internal xml document, internal xml content, etc. I want to extract the data tables from this saved local document (really, a website).
data_readLines <- readLines("myfile.htm")
data_readLines <- data_readLines[which(str_count(data_readLines,'table') >= 2)]
toHTML(data_readLines[1]) %>% html_table()
toHTML(data_readLines[1]) %>% html_table() Error in UseMethod("html_table") : no applicable method for 'html_table' applied to an object of class "c('XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode')"
I've tried some rather crazy things like:
xml_children(xml_children(xml_children(data_html)))
Which returns:
xml_children(xml_children(xml_children(data_html))) {xml_nodeset (4)} [1] Report Help [2] Updated 10-02-2023 13:31:29 [3] Summary Card [4] \n\nDescr ...
But, there's gotta be something out there that will do this. Any ideas on how to get the "tables" out of this mess would be great. Sorry about having no reproducible example here. I just can't share the data and I don't know how to make a fake example.