How to ignore line breaks and whitespace when webscraping

60 Views Asked by At

When inspecting the tree structure of some HTML using rvest, I noticed that all line breaks and white space end up as (text) nodes - which I find slightly annoying when querying the DOM. Is there any way to ignore those (rather than having to clean the data ex-post)?

In this example, I would like to select only nodes that contain "actual" text. All the {text} entries shown below that are siblings of <h2> and <p> only contain line breaks/whitespace.

Edit: I am looking for a way to select nodes, not to extract text from them.

mini <- rvest::minimal_html(
  '<div>
    <h2>Heading</h2>
    <p>Some text</p>
  </div>'
)

xml2::html_structure(mini)
#> <html>
#>   <head>
#>     <meta [charset]>
#>     <title>
#>   <body>
#>     <div>
#>       {text}
#>       <h2>
#>         {text}
#>       {text}
#>       <p>
#>         {text}
#>       {text}

Created on 2024-02-17 with reprex v2.1.0

The same issue occurs with real-world data, such as https://scrapeme.live/:

library(rvest)
library(xml2)

html <- read_html("https://scrapeme.live/")

html |> 
  html_element("div.page-content") |> 
  html_structure(indent = 4)
#> <div.page-content>
#>     {text}
#>     <p>
#>         {text}
#>     {text}
#>     <form.search-form [role, method, action]>
#>         {text}
#>         <label [for]>
#>             {text}
#>             <span.screen-reader-text>
#>                 {text}
#>             {text}
#>         {text}
#>         <input#search-form-65d0cce0a5dd2 .search-field [type, placeholder, value, name]>
#>         <button.search-submit [type]>
#>             <svg.icon.icon-search [aria-hidden, role]>
#>                 <use [href, xlink:href]>
#>             <span.screen-reader-text>
#>                 {text}
#>         {text}
#>     {text}

html |> 
  html_element("div.page-content") |> 
  html_elements(xpath = ".//text()")
#> {xml_nodeset (11)}
#>  [1] \n\t\t\n\t\t\t
#>  [2] It seems we can’t find what you’re looking for. Perhaps searching can help.
#>  [3] \n\t\t\t\n\n
#>  [4] \n\t
#>  [5] \n\t\t
#>  [6] Search for:
#>  [7] \n\t
#>  [8] \n\t
#>  [9] Search
#> [10] \n
#> [11] \n\t

Created on 2024-02-17 with reprex v2.1.0

1

There are 1 best solutions below

1
HoelR On

Are you looking for html_text2()

library(rvest)

"https://scrapeme.live" %>% 
  read_html() %>% 
  html_element("div.page-content") %>% 
  html_text2()

[1] "It seems we can’t find what you’re looking for. Perhaps searching can help.\n\nSearch for: Search"