I would like to download a pdf from this website using R. The problem is that you first have to click on the "Maak een pdf" button on the website. Because this is an javascript onclick attribute. I'm able to find the attribute but I have no idea how to download this pdf file. Here is an screenshot of the element inspection:
Here is the code I tried:
library(tidyverse)
library(rvest)
link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"
button <- link %>%
read_html() %>%
html_nodes(".download-als") %>%
html_nodes("a") %>%
html_attr("href")
button
#> [1] "javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00\", \"\", true, \"\", \"\", false, true))"
download.file(button, destfile = "Downloads/test.pdf")
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): URL
#> javascript:WebForm_DoPostBackWithOptions(new
#> WebForm_PostBackOptions("ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00", "",
#> true, "", "", false, true)): cannot open destfile 'Downloads/test.pdf', reason
#> 'No such file or directory'
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): download had
#> nonzero exit status
Created on 2024-02-05 with reprex v2.0.2
I tried to download.file the file but of course that doesn't work. It seems that we need to use the RSelenium to create a click action on the button via a browser. I found this question: How to web-scrape on-click information with R? but I can't find a way to do this with an "onclick" attribute. So I was wondering if anyone knows how to download a pdf file from an onclick attribute?

To get to that final download link from the document page, we need to play some request/response ping-pong to mimic javascript application -- first, we'd need to submit a request to the backend, then wait for it to finish and continue with the download.
To recover that exact flow and used endpoint (
/PUC/Handlers/ManifestatieService.ashx), we should focus on Network tab of browser's dev tools (activate it before clicking through download process to record all relevant requests/responses); if there's too much traffic, search and filter can be quite handy:To implement flow that's close enough, we'll mostly rely on
httr2;rvestis only used to extract JavaScript function parameters from link'sonclickattribute. Though in this particular case, we could probably extract identifierPUC_746615_17andkanaalvalue (natuurvergunningen) directly from document URL too.Created on 2024-02-05 with reprex v2.0.2
Alternative approaches would be based on tools that can handle JavaScript, i.e. Chromote or RSelenium, for example. And perhaps
webdriverwith PhantomJS.