R rvest read_html() returns almost empty page

Question

R rvest read_html() returns almost empty page

55 Views Asked by sketman At 08 January 2024 at 21:12

I want to scrape links to ads on this page: https://reality.idnes.cz/s/?page=1 usin R, rvest and httr packages. It returns results which I do not understand.

The code is:

link <- "https://reality.idnes.cz/s/?page=1"
response <- httr::GET(link)
page <- rvest::read_html(response)

In the code above I get correct return code 200 for the "response" object, but the "page" object after calling read_html() is almost empty, it does not contains the web page content.

When I do:

object.size(response)

the result is something like this:

132464 bytes

So this object contains data, looks correct. But when I do:

object.size(page)

the result is:

784 bytes

The same applies if I call read_html(link) directly, the resulting object size is the same 784 bytes. Why is "page" object almost empty, what happens when calling "page <- rvest::read_html(response) ?"

Many thanks in advance for any help.

Original Q&A

There are 3 best solutions below

Allan Cameron On 08 January 2024 at 21:36

You are confused by object.size(page) because rvest uses xml2 under the hood, which provides bindings to libxml2. page is essentially just a couple of external pointers wrapped in an S3 class:

class(page)
#> [1] "xml_document" "xml_node" 

unclass(page)
#> $node
#> <pointer: 0x0000029218e61bd0>
#>  
#>  $doc
#> <pointer: 0x00000292105f5520>

You can still scrape the data from the page:

data.frame(info = page |> 
                  rvest::html_elements(".c-products__info") |> 
                  rvest::html_text() |>
                  trimws(),
           price = page |> 
                   rvest::html_elements(".c-products__price") |> 
                   rvest::html_text() |>
                   trimws() |>
                   strsplit("\n") |> 
                   sapply(getElement, 1))
#>                                                           info            price
#> 1                                 Sokolovská, Praha 8 - Karlín  25 000 Kc/mesíc
#> 2                                Premyslovice, okres Prostejov     3 400 000 Kc
#> 3         Deštné v Orlických horách, okres Rychnov nad Knežnou     6 220 000 Kc
#> 4                         Veleslavínova, Praha 1 - Staré Mesto Cena na vyžádání
#> 5                                           Žatec, okres Louny Cena na vyžádání
#> 6                                                        Decín Cena na vyžádání
#> 7                               K Šafránce, Praha 9 - Strížkov     5 090 000 Kc
#> 8                        Poštovní, Mariánské Lázne, okres Cheb Cena na vyžádání
#> 9                      Mladobucká, Trutnov - Horní Staré Mesto    10 499 000 Kc
#> 10                                       Bavory, okres Breclav Cena na vyžádání
#> 11                                   Karlovy Vary - Stará Role     1 500 000 Kc
#> 12                           Šanovská, Hrabetice, okres Znojmo     3 600 000 Kc
#> 13 Sídlište na Sadech, Ceské Velenice, okres Jindrichuv Hradec     3 150 000 Kc
#> 14                               Mezihorská, Praha 4 - Modrany    13 999 999 Kc
#> 15                                     Stankovice, okres Louny       293 284 Kc
#> 16                                           Praha 10 - Benice     2 210 000 Kc
#> 17                                   Nýrany, okres Plzen-sever       509 296 Kc
#> 18              námestí Tyršovo, Chocen, okres Ústí nad Orlicí Cena na vyžádání
#> 19                                               Okružní, Zlín     3 030 000 Kc
#> 20                              Jana Maluchy, Ostrava - Dubina     2 290 000 Kc

sketman On 08 January 2024 at 22:23

@MrFlick, @Alan Cameron,

Many thanks to both of you for your help. I did not realize that the "page" object is just a pointer. As you both mentioned, the web page is nicely scrapable. Because I wanted to get links to ads, the code would be as follows:

page_links <<- page %>% html_elements(".c-products__inner a") %>% html_attr("href")

> page_links
 [1] "https://reality.idnes.cz/detail/prodej/dum/jilove-u-prahy/659c63e11570d3d1510f549f/"                   
 [2] "https://reality.idnes.cz/detail/pronajem/byt/praha-6-luzna/659c6386970504c7cc001df8/"                  
 [3] "https://reality.idnes.cz/detail/prodej/byt/jablonec-nad-nisou-na-vysine/659c626d51d3c91d2503969e/"     
 [4] "https://reality.idnes.cz/detail/prodej/byt/cesky-krumlov-polska/651d7c902361bb41b4076985/"             
 [5] "https://reality.idnes.cz/detail/prodej/pozemek/slabcice/655b1420ee38dc5d6a09b6c8/"                     
 [6] "https://reality.idnes.cz/detail/prodej/pozemek/brno-rozmarynova/630c52b3080e61494a733213/"             
 [7] "https://reality.idnes.cz/detail/prodej/dum/srbice/6579f2884e7b1614be0dfb4d/"                           
 [8] "https://reality.idnes.cz/detail/prodej/dum/estepona/6435023225fbc6c8d202ee2a/"                         
 [9] "https://reality.idnes.cz/detail/pronajem/komercni-nemovitost/praha-2/654fb46786faea2ada0e4b87/"        
[10] "https://reality.idnes.cz/detail/prodej/byt/praha-5-radlicka/62fd4efa92f85e440c77418f/"                 
[11] "https://reality.idnes.cz/detail/prodej/komercni-nemovitost/praha-5-na-pomezi/652e6696626465e6e80f6839/"
[12] "https://reality.idnes.cz/detail/pronajem/byt/prerov-kojetinska/65251d8da64b83180202cd08/"              
[13] "https://reality.idnes.cz/detail/prodej/byt/brno-jeneweinova/6454ea0060384f0fd407e432/"                 
[14] "https://reality.idnes.cz/detail/pronajem/byt/praha-8-sokolovska/6581a22a98c1ceb182093091/"             
[15] "https://reality.idnes.cz/detail/prodej/komercni-nemovitost/premyslovice/643fccc1efe33b7f30081ba6/"     
[16] "https://reality.idnes.cz/detail/prodej/byt/destne-v-orlickych-horach/626fc68a68c0cd78ca57e11f/"        
[17] "https://reality.idnes.cz/detail/prodej/byt/praha-1-veleslavinova/62b1be59bc6745024a065cde/"            
[18] "https://reality.idnes.cz/detail/pronajem/komercni-nemovitost/zatec/5f918061a8b8eb509a508c63/"          
[19] "https://reality.idnes.cz/detail/pronajem/komercni-nemovitost/decin/64cc9898aae6faf40109300c/"          
[20] "https://reality.idnes.cz/detail/prodej/byt/praha-9-k-safrance/653951547c984a8bd4005009/"

Thanks once again.

**MrFlick** · Accepted Answer · 2024-01-08T21:30:43.553000

That's because page is a wrapped pointer to memory. That variable in R doesn't contain all the data. It points to memory where the data is stored.

str(page)
# List of 2
#  $ node:<externalptr> 
#  $ doc :<externalptr> 
#  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

If you convert to a character you can get all the data object.size(as.character(page)). It's all there. It's just that object.size is not a reliable way to know how much data is stored with a particular variable when pointers are involved.

You should be able to extract all the data there without an issue. Like you can find all the <div> tags with page |> rvest::html_nodes("div")

R rvest read_html() returns almost empty page

There are 3 best solutions below

Related Questions in R

Related Questions in WEB-SCRAPING

Related Questions in RVEST

Related Questions in HTTR

Trending Questions

Popular # Hahtags

Popular Questions