Extract text from dynamic Web page using R

Question

Extract text from dynamic Web page using R

393 Views Asked by Dominic Comtois At 30 January 2021 at 14:14

I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.

I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:

Target	Attack
AAA News	Fake News
AAA News	Fake News
AAA News	A total disgrace
...	...
Mr. ZZZ	A real nut job

but I'd like to show how to do everything programmatically (no copy/paste).

My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?

PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\

Original Q&A

There are 2 best solutions below

Dave2e On 30 January 2021 at 14:59

I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.

If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:

tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv

It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.

Be sure to read the term of service before scraping any webpage.

**Ian Campbell** · Accepted Answer · 2021-01-30T16:06:41.740000

Here's a programatic approach with RSelenium and rvest:

library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]

#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
  html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 

#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                  html_text)

#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                            map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                            map(html_text))

#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
         ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                        data.frame(Entity = NA, Insult = NA)}})) -> Result

#Strip out the quotes
Result %>%
  mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result

#Take a look at the result
Result %>%
  slice_sample(n=10)
                   Entity                                                              Insult
1             Mitt Romney                                       failed presidential candidate
2         Hillary Clinton                                                             Crooked
3  The “mainstream” media                                                           Fake News
4               Democrats                                             on a fishing expedition
5           Pete Ricketts                                             illegal late night coup
6  The “mainstream” media                                                   anti-Trump haters
7     The Washington Post do nothing but write bad stories even on very positive achievements
8               Democrats                                                                weak
9             Marco Rubio                                                         Lightweight
10     The Steele Dossier                                                      a Fake Dossier

The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

Extract text from dynamic Web page using R

There are 2 best solutions below

Related Questions in HTML

Related Questions in R

Related Questions in WEB-SCRIPTING

Trending Questions

Popular # Hahtags

Popular Questions