Different number of rows for headlines and urls when web-scraping Google News using rvest

57 Views Asked by At

Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.

library(rvest)
library(stringr)
library(magrittr)

link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)

data.frame(headline, url)

Results:

Error in data.frame(headline, url) : 
  arguments imply differing number of rows: 82, 85

But with other keywords, this seems to work fine.

link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)

data.frame(headline, url)

Anyone knows the issue for this, and how to fix it? Thanks

1

There are 1 best solutions below

0
margusl On

With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.

Get the url and text from the same node and you should be covered:

url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
data.frame(
  headline =  headline_links %>% html_text(),
  url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
)