How to extract the inner text or inner html from a html node in R

119 Views Asked by At

I am trying to extract the date from a html node using R. This script used to work fine and I think perhaps the webpage has changed somewhat and now it returns N/A.

webpage = read_html('https://www.longpaddock.qld.gov.au/aussiegrass')
results = webpage %>% html_nodes("#graph-out-last-updated")
webpage_date_chr = html_text(results)
webpage_date_chr  ## This should print the date!!!

The image shows the node and the date, but I cannot extract the date. enter image description here Any help would be amazing!!

Cheers

2

There are 2 best solutions below

1
timnus On

It looks like the page is a dynamic site, so the page needs to be loaded first before the data can be scraped. You'll need to use a headless browser, like that used in the chromote package.

I've edited your code to something that returned the date "21 June 2023" for me.

library(tidyverse)
library(chromote)
library(rvest)

chromote_scrape <- function(url) {
  b$Page$navigate(url)
  Sys.sleep(2)
  x <- b$DOM$getDocument()
  x <- b$DOM$querySelector(x$root$nodeId, "body")
  read_html(b$DOM$getOuterHTML(x$nodeId)$outerHTML)
}

b <- ChromoteSession$new()
webpage = chromote_scrape('https://www.longpaddock.qld.gov.au/aussiegrass')
results = webpage %>% html_nodes("#graph-out-last-updated")
webpage_date_chr = html_text(results)
webpage_date_chr  ## This should print the date!!!

For more info, read section 25.7 of R4DS here: https://r4ds.hadley.nz/webscraping.html

1
margusl On

Those details are sourced from a single JSON file and guessing from the URL, that endpoint should be fairly stable: https://www.longpaddock.qld.gov.au/data/aussie-grass-graphs.json

Structure itself is nothing special: enter image description here Yet not ideal for automatic rectangling to data.frame, here's one example how one might approach this with purrr, tibble and tidyr:

library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)

grass_graphs <- jsonlite::fromJSON("https://www.longpaddock.qld.gov.au/data/aussie-grass-graphs.json")

lga <- grass_graphs %>% 
  # turn list inside out, top level transforms from 
  # (ACT, NSW, NT, ...) to (lga, subibra)
  list_transpose() %>% 
  # work only with lga items
  pluck("lga") %>%
  # turn list into 2 column name-value tibble, values being lists
  tibble::enframe(name = "region", value = "lga_list") %>% 
  # unnest list columns to longer or wider, one level at a time
  unnest_longer(lga_list, indices_to = "lga") %>% 
  unnest_wider(lga_list) %>% 
  unnest_wider(gif:txt, names_sep = ".") %>% 
  # convert unix timestamps
  mutate(across(ends_with("date"), ~ as_datetime(.x) %>% as_date()))
lga
#> # A tibble: 555 × 8
#>    region gif.size gif.date   pdf.size pdf.date   txt.size txt.date   lga       
#>    <chr>     <int> <date>        <int> <date>        <int> <date>     <chr>     
#>  1 ACT       44134 2018-07-31   174970 2023-06-20   286953 2023-06-20 Act       
#>  2 NSW       43763 2018-07-31   176189 2023-06-20   286953 2023-06-20 AlburyCit…
#>  3 NSW       44483 2018-07-31   176482 2023-06-20   286953 2023-06-20 ArmidaleR…
#>  4 NSW       43769 2018-07-31   175293 2023-06-20   286953 2023-06-20 BallinaSh…
#>  5 NSW       44956 2018-07-31   177756 2023-06-20   286953 2023-06-20 Balranald…
#>  6 NSW       44267 2018-07-31   176878 2023-06-20   286953 2023-06-20 BathurstR…
#>  7 NSW       43701 2018-07-31   173621 2023-06-20   286953 2023-06-20 BaysideCo…
#>  8 NSW       44216 2018-07-31   176428 2023-06-20   286953 2023-06-20 BegaValle…
#>  9 NSW       43996 2018-07-31   175817 2023-06-20   286953 2023-06-20 Bellingen…
#> 10 NSW       43992 2018-07-31   176204 2023-06-20   286953 2023-06-20 BerriganS…
#> # ℹ 545 more rows

Extract file sizes and dates for a single Shire / LGA:

lga %>% filter(stringr::str_detect(lga, "Aurukun"))
#> # A tibble: 1 × 8
#>   region gif.size gif.date   pdf.size pdf.date   txt.size txt.date   lga        
#>   <chr>     <int> <date>        <int> <date>        <int> <date>     <chr>      
#> 1 QLD       51200 2018-07-31   184589 2023-06-20   286953 2023-06-20 AurukunShi…

Created on 2023-06-28 with reprex v2.0.2