Extract data URL with javascript (table in php)

224 Views Asked by At

I want to extract the data from this web page, http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php, it uses java script at the moment I have not been able to find a way to extract the data of volume and prices of daily frequency. Precios Emmsa precios Emmsa

I have tried many alternatives that are presented on this page but none have worked for me because it is a table that is obtained in two steps.

I have tried to adapt this code that appears here https://www.r-bloggers.com/2020/04/an-adventure-in-downloading-books/ But I couldn't download the data.

my version is :

library(Rcrawler)

install_browser() # One time only

br <- run_browser()

page<-LinkExtractor(url="http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php",
                    Browser = br, ExternalLInks = TRUE)


el <- page$InternalLinks
sprlnks <- el[grep("emmsa", el, fixed = TRUE)]

for (sprlnk in sprlnks) {
  spr_page <- LinkExtractor(sprlnk)
  il <- spr_page$InternalLinks
  ttl <- spr_page$Info$Title
  ttl <- trimws(strsplit(ttl, "|", fixed = TRUE)[[1]][1])
  chapter_link <- il[grep("chapter", il, fixed = TRUE)][1]
  chp_splits <- strsplit(chapter_link, "/", fixed = TRUE)
  n <- length(chp_splits[[1]])
  suff <- chp_splits[[1]][n]
  suff <- gsub(".{2}$", "", suff)
  pref <- chp_splits[[1]][n-1]
  final_url <- paste0("http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php", pref, "/",
                      suff, ".php")
  print(final_url)
  download.file(final_url, paste0(ttl, ".php"), mode = "wb")
  Sys.sleep(5)
}

stop_browser(br)

I get a file "Empresa Municipal de Mercados S.A.php" that is constantly repeated in which line 294 appears

Finally, what I want is that you can help me generate a script that allows me to download the daily price and volume data from the "emmsa" website.

1

There are 1 best solutions below

6
QHarr On

You could do a POST request, as the page does and parse out the table from the response

library(httr)
library(rvest)
library(janitor)
library(dplyr)

headers <- c("Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8")

data <- "vid_tipo=1&vprod=&vvari=&vfecha=15/06/2022"

r <- httr::POST(
  url = "http://old.emmsa.com.pe/emmsa_spv/app/reportes/ajax/rpt07_gettable.php",
  httr::add_headers(.headers = headers),
  body = data
)

t <- content(r) %>%
  html_element(".timecard") %>%
  html_table() %>%
  row_to_names(1) %>%
  clean_names() %>%
  dplyr::filter(producto != "") %>%
  mutate_at(vars(matches("precio")), as.numeric)

Volume option (different html)

library(httr)
library(rvest)
library(janitor)
library(dplyr)

headers <- c("Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8")

data <- "vid_tipo=2&vprod=&vvari=&vfecha=17/06/2022"

r <- httr::POST(
  url = "http://old.emmsa.com.pe/emmsa_spv/app/reportes/ajax/rpt07_gettable.php",
  httr::add_headers(.headers = headers),
  body = data
)

t <- content(r) %>%
  html_element("#tbReport") %>%
  html_table()  %>%
  clean_names()