Web scraping issue in R (xml_nodeset)

45 Views Asked by At

I am using the Rvest and RSelenium packages in R to extract data from a website.

My current code is the following:

install.packages("RSelenium")
library(RSelenium)

rD <- rsDriver(browser = "chrome", port = 4447L, geckover = NULL, 
               chromever =  "latest", iedrver = NULL, 
               phantomver = NULL)
remDr <- rD[["client"]] 

remDr$navigate("https://data.anbima.com.br/fundos?page=1&size=100&classe_anbima=A%C3%A7%C3%B5es&tipo_anbima=&benchmark=")

remDr$findElements("id", "item-title-1")[[1]]$clickElement() 

**# Extracting data from HTML**
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]

primeiro_aporte <- read_html(html) %>% # parse HTML
  html_nodes(xpath='//*[@id="output__container--primeiroAporte"]/div/span')  
primeiro_aporte

The code above outputs the following:

> {xml_nodeset (1)}
[1] <span class="anbima-ui-output__value">27/11/2020</span>

However, what I actually need is to extract the data (in this case, 27/11/2020). Nothing I tried worked so far. I would appreciate any help! Thanks!

1

There are 1 best solutions below

0
Russ On

Not sure if you're still looking to solve this, but you can extract the text of an html element using this:

primeiro_aporte %>% html_text()

[1] "27/11/2020"

This also works for attributes, for example you can extract all the href attributes from all the links on the page like this:

html %>% 
  read_html()%>%
  html_nodes("a") %>%
  html_attr("href")