I want to scrape links to ads on this page: https://reality.idnes.cz/s/?page=1 usin R, rvest and httr packages. It returns results which I do not understand.
The code is:
link <- "https://reality.idnes.cz/s/?page=1"
response <- httr::GET(link)
page <- rvest::read_html(response)
In the code above I get correct return code 200 for the "response" object, but the "page" object after calling read_html() is almost empty, it does not contains the web page content.
When I do:
object.size(response)
the result is something like this:
132464 bytes
So this object contains data, looks correct. But when I do:
object.size(page)
the result is:
784 bytes
The same applies if I call read_html(link) directly, the resulting object size is the same 784 bytes. Why is "page" object almost empty, what happens when calling "page <- rvest::read_html(response) ?"
Many thanks in advance for any help.
That's because
pageis a wrapped pointer to memory. That variable in R doesn't contain all the data. It points to memory where the data is stored.If you convert to a character you can get all the data
object.size(as.character(page)). It's all there. It's just thatobject.sizeis not a reliable way to know how much data is stored with a particular variable when pointers are involved.You should be able to extract all the data there without an issue. Like you can find all the
<div>tags withpage |> rvest::html_nodes("div")