When inspecting the tree structure of some HTML using rvest, I noticed that all line breaks and white space end up as (text) nodes - which I find slightly annoying when querying the DOM. Is there any way to ignore those (rather than having to clean the data ex-post)?
In this example, I would like to select only nodes that contain "actual" text. All the {text} entries shown below that are siblings of <h2> and <p> only contain line breaks/whitespace.
Edit: I am looking for a way to select nodes, not to extract text from them.
mini <- rvest::minimal_html(
'<div>
<h2>Heading</h2>
<p>Some text</p>
</div>'
)
xml2::html_structure(mini)
#> <html>
#> <head>
#> <meta [charset]>
#> <title>
#> <body>
#> <div>
#> {text}
#> <h2>
#> {text}
#> {text}
#> <p>
#> {text}
#> {text}
Created on 2024-02-17 with reprex v2.1.0
The same issue occurs with real-world data, such as https://scrapeme.live/:
library(rvest)
library(xml2)
html <- read_html("https://scrapeme.live/")
html |>
html_element("div.page-content") |>
html_structure(indent = 4)
#> <div.page-content>
#> {text}
#> <p>
#> {text}
#> {text}
#> <form.search-form [role, method, action]>
#> {text}
#> <label [for]>
#> {text}
#> <span.screen-reader-text>
#> {text}
#> {text}
#> {text}
#> <input#search-form-65d0cce0a5dd2 .search-field [type, placeholder, value, name]>
#> <button.search-submit [type]>
#> <svg.icon.icon-search [aria-hidden, role]>
#> <use [href, xlink:href]>
#> <span.screen-reader-text>
#> {text}
#> {text}
#> {text}
html |>
html_element("div.page-content") |>
html_elements(xpath = ".//text()")
#> {xml_nodeset (11)}
#> [1] \n\t\t\n\t\t\t
#> [2] It seems we can’t find what you’re looking for. Perhaps searching can help.
#> [3] \n\t\t\t\n\n
#> [4] \n\t
#> [5] \n\t\t
#> [6] Search for:
#> [7] \n\t
#> [8] \n\t
#> [9] Search
#> [10] \n
#> [11] \n\t
Created on 2024-02-17 with reprex v2.1.0
Are you looking for
html_text2()