How can I scrape a table from the website in the question

Question

How can I scrape a table from the website in the question

52 Views Asked by MCP_infiltrator At 22 February 2024 at 13:02

I am trying to copy a table from a webpage, there are going to be many as I am trying to get the versions of the data for each dataset, I am trying to get at least one table but failing. Scraping is not my thing, maybe it is obvious how to get it but not to me.

Here is my code:

url <- "https://data.cms.gov/provider-characteristics/medicare-provider-supplier-enrollment/medicare-fee-for-service-public-provider-enrollment/api-docs"

html <- rvest::read_html(url)
> html |> rvest::html_node(".table")
{xml_missing}
<NA>

And

> html |>
 rvest::html_node(xpath = "/html/body/div/div/div/div/div/div/div[2]/div[2]/div/div/table/tbody")
{xml_missing}
<NA>

And

html |>
  rvest::html_node("tbody")

The output showed that javascript needed to be enabled. The first answer works for the current dataset and the second answer I posted works to get most of the distributions where accessURL and title are present in the output.

Original Q&A

There are 2 best solutions below

MCP_infiltrator On 22 February 2024 at 16:09

I was also able to get most of what I want by doing the following:

url <- "https://data.cms.gov/data.json"

data_sets <- httr2::request(url) |>
  httr2::req_perform() |>
  httr2::resp_body_json(check_type = FALSE, simplifyVector = TRUE)

data_tbl <- data_sets$dataset |>
  dplyr::tibble() |>
  dplyr::select(title, modified, keyword, description, contactPoint, identifier) |>
  tidyr::unnest(cols = c(keyword, contactPoint)) |>
  dplyr::select(-`@type`) |>
  dplyr::mutate(hasEmail = stringr::str_remove(hasEmail, "mailto:"))

data_distribtution <- function(.data){
  df <- .data
  
  dplyr::tibble(data = df[["dataset"]][["distribution"]]) |>
    dplyr::mutate(cnm = purrr::map(data, \(x) names(x))) |>
    dplyr::mutate(title_in_col = purrr::map(cnm, \(x) "title" %in% x) |> unlist()) |>
    dplyr::mutate(accessURL_in_col = purrr::map(cnm, \(x) "accessURL" %in% x) |> unlist()) |>
    dplyr::filter(title_in_col == TRUE & accessURL_in_col == TRUE) |>
    dplyr::select(data) |>
    dplyr::mutate(data = purrr::map(
      data, \(x) x |>
        dplyr::select(title, modified, temporal, accessURL, 
                      downloadURL, mediaType) |>
        dplyr::mutate(data_link = dplyr::coalesce(accessURL, downloadURL)) |>
        dplyr::mutate(mediaType = ifelse(is.na(mediaType), "api", mediaType)) |>
        dplyr::mutate(modified = as.Date(modified)) |>
        dplyr::select(-accessURL, -downloadURL) |>
        dplyr::arrange(dplyr::desc(modified), mediaType)
    )) |>
    tidyr::unnest(cols = c(data))
}

Sample Output

> data_distribtution(.data = data_sets) |>
+   head(1) |>
+   dplyr::glimpse()
Rows: 1
Columns: 5
$ title     <chr> "Accountable Care Organization Participants : 2024-01-24"
$ modified  <date> 2024-01-29
$ temporal  <chr> "2024-01-01/2024-12-31"
$ mediaType <chr> "api"
$ data_link <chr> "https://data.cms.gov/data-api/v1/dataset/9767cb68-8ea9-4f0b-8179-9431abc89f1…

**datawookie** · Accepted Answer · 2024-02-22T13:18:54.143000

Unfortunately this approach is not going to work. The tables in the page you're looking at are generated via JavaScript. The rvest::read_html(url) call will retrieve the static content on that page but will not execute any (dynamic) JavaScript.

But there is an API behind the site, so you can get the data directly from that. For example:

library(httr)

params = list(
  `path` = "/provider-characteristics/medicare-provider-supplier-enrollment/medicare-fee-for-service-public-provider-enrollment"
)

res <- httr::GET(url = "https://data.cms.gov/data-api/v1/slug", query = params)

cat(content(res, as="text", encoding = "UTF-8"))

Alternatively you can use something like {RSelenium} to evaluate the JavaScript and then scrape the fully rendered page.

How can I scrape a table from the website in the question

There are 2 best solutions below

Related Questions in HTML

Related Questions in R

Related Questions in WEB

Related Questions in RVEST

Trending Questions

Popular # Hahtags

Popular Questions