Different number of rows for headlines and urls when web-scraping Google News using rvest

57 Views Asked by Hassan Rahim Kamil At 20 July 2022 at 06:34

Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.

library(rvest)
library(stringr)
library(magrittr)

link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)

data.frame(headline, url)

Results:

Error in data.frame(headline, url) : 
  arguments imply differing number of rows: 82, 85

But with other keywords, this seems to work fine.

link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)

data.frame(headline, url)

Anyone knows the issue for this, and how to fix it? Thanks

Original Q&A

There are 1 best solutions below

margusl On 20 July 2022 at 13:46

With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.

Get the url and text from the same node and you should be covered:

url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
data.frame(
  headline =  headline_links %>% html_text(),
  url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
)

Different number of rows for headlines and urls when web-scraping Google News using rvest

There are 1 best solutions below

Related Questions in R

Related Questions in RVEST

Related Questions in GOOGLE-NEWS

Trending Questions

Popular # Hahtags

Popular Questions