Scraping data from Reddit - No API Sept 2023

900 Views Asked by At

I know recently Reddit changed their way to handle APIs and it is very restrictive now. I am working on a school project and need Reddit data on Stocks (subredits: Wallstreetbets, StockMarket). I am currently trying to scrape the pages from Old Reddit but only get a few records out. I was expecting a lot more data.

I have the following code, and even though I have the num_pages_to_scrapre set to 5000, I only get 138 records out. I thought that maybe next_button is not working correctly, or I should change the time.sleep(2) but still I get the same results. Please help!

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

url = "https://old.reddit.com/r/wallstreetbets"
headers = {'User-Agent': 'Mozilla/5.0'}

data = []  # List to store post data

#Set the desired number of pages
num_pages_to_scrape = 5000

for counter in range(1, num_pages_to_scrape + 1):
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    posts = soup.find_all('div', class_='thing', attrs={'data-domain': 'self.wallstreetbets'})

    for post in posts:
        title = post.find('a', class_='title').text
        author = post.find('a', class_='author').text
        comments = post.find('a', class_='comments').text.split()[0]

        if comments == "comment":
            comments = 0

        likes = post.find("div", class_="score likes").text

        if likes == "•":
            likes = "None"
        
        # Extract the date information from the HTML
        date_element = post.find('time', class_='live-timestamp')
        date = date_element['datetime'] if date_element else "N/A"
        formatted_date = pd.to_datetime(date, utc=True).strftime('%Y-%m-%d %H:%M:%S')

        data.append([formatted_date, title, author, comments, likes])

    next_button = soup.find("span", class_="next-button")
    if next_button:
        next_page_link = next_button.find("a").attrs['href']
        url = next_page_link
    else:
        break

    time.sleep(2)

#Create df
columns = ['Date', 'Title', 'Author', 'Comments', 'Likes']
df = pd.DataFrame(data, columns=columns)

#Print the DataFrame
df
1

There are 1 best solutions below

0
Andrej Kesely On

Here is an example skeleton, how you can use their Json API to download multiple pages of data (note: to get data in Json form, add .json at the end of the URL):

import json

import requests

# api doc: https://old.reddit.com/dev/api/
url = "https://reddit.com/r/wallstreetbets.json"
headers = {"User-Agent": "Mozilla/5.0"}

data = requests.get(url, headers=headers).json()

while True:
    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))
    for c in data["data"]["children"]:
        print(c["data"]["title"])

    url = "https://reddit.com/r/wallstreetbets.json?after=" + data["data"]["after"]
    data = requests.get(url, headers=headers).json()

    if not data:
        break

Prints:

...

Remember kids, bankruptcies are good for stonks                                                                                                                                                                    
Bullish news - Tesla mcflurry cyber spoon                                                                
King of dilution strikes again                                                                           
Unleashing the Power of YouTube for Alpha Gains    
When is PayPal going                                                                               
He's been right so far                                                                                                                                                                                             
FuelCell teaming up with Toyota. First ever Tri-gen plant up and running
Bad day to be an Ape                                                                                     

...