How to web scrape google news headline of a particular year (e.g. news from 2020)

54 Views Asked by At

I've been exploring web scraping techniques using Python and RSS feed, but I'm not sure how to narrow down the search results to a particular year on Google News. Ideally, I'd like to retrieve headlines, publication dates, and possibly summaries for news articles from a specific year (such as 2020). With the code provided below, I can scrape the current data, but if I try to look for news from a specific year, it isn't available. Even when I use the Google articles search box, the filter only shows results from the previous year. However, when I scroll down, I can see articles from 2013 and 2017. Could someone provide me with a Python script or pointers on how to resolve this problem?

Here's what I've attempted so far:

import feedparser
import pandas as pd
from datetime import datetime

class GoogleNewsFeedScraper:
    def __init__(self, query):
        self.query = query

    def scrape_google_news_feed(self):
        formatted_query = '%20'.join(self.query.split())
        rss_url = f'https://news.google.com/rss/search?q={formatted_query}&hl=en-IN&gl=IN&ceid=IN%3Aen'
        feed = feedparser.parse(rss_url)
        titles = []
        links = []
        pubdates = []

        if feed.entries:
            for entry in feed.entries:
                # Title
                title = entry.title
                titles.append(title)
                # URL link
                link = entry.link
                links.append(link)
                # Date
                pubdate = entry.published
                date_str = str(pubdate)
                date_obj = datetime.strptime(date_str, "%a, %d %b %Y %H:%M:%S %Z")
                formatted_date = date_obj.strftime("%Y-%m-%d")
                pubdates.append(formatted_date)

        else:
            print("Nothing Found!")

        data = {'URL link': links, 'Title': titles, 'Date': pubdates}
        return data

    def convert_data_to_csv(self):
        d1 = self.scrape_google_news_feed()
        df = pd.DataFrame(d1)
        csv_name = self.query + ".csv"
        csv_name_new = csv_name.replace(" ", "_")
        df.to_csv(csv_name_new, index=False)


if __name__ == "__main__":
    query = 'forex rate news'
    scraper = GoogleNewsFeedScraper(query)
    scraper.convert_data_to_csv()
1

There are 1 best solutions below

0
PromptCloud On BEST ANSWER

You can use date filters in your rss_url. modify the query part in the below format

Format: q=query+after:yyyy-mm-dd+before:yyyy-mm-dd

Example: https://news.google.com/rss/search?q=forex%20rate%20news+after:2023-11-01+before:2023-12-01&hl=en-IN&gl=IN&ceid=IN:en

The URL above returns articles related to forex rate news that were published between November 1st, 2023, and December 1st, 2023.

Please refer to this article for more information.