How to Resume Fetching Posts from a Specific Point (PRAW)?

59 Views Asked by At

I am working on a Python script using PRAW to fetch posts from a subreddit (my whole goal is to make a dataset about a specific sub), and I am facing an issue with handling pagination. The script is designed to fetch new posts and save them to a CSV file. However, when it reaches the maximum limit of 1000 posts, it skips the remaining and ends the program.

The challenge I'm encountering is that I can't seem to start fetching posts from where it left off. I want to implement a solution without using the Pushshift API since I am not a moderator.

def main():
    reddit = create_reddit_instance()
    subreddit = get_subreddit(reddit, 'sub_name')
    df, existing_ids = load_existing_data(FILENAME)

    logging.info('Starting to scrape posts')

    skipped_posts = 0
    loaded_posts = 0

    # Load the last fetched post ID
    last_fetched_post_id = load_last_fetched_post_id()

    # Fetch posts after the last fetched post
    top_posts = list(subreddit.new(limit=None, params={
                     'after': last_fetched_post_id}))

    for submission in top_posts:
        if submission.id in existing_ids or submission.url.endswith(('.jpg', '.png', '.gif', '.jpeg')) or submission.score < 10:
            logging.info(f'Skipped post {submission.id}')
            skipped_posts += 1
            continue
        top_comments = get_top_comments(submission)
        new_row = get_new_row(submission, top_comments)
        df = df._append(new_row, ignore_index=True)  # ! its always df._append
        save_data(df, FILENAME)
        logging.info(f'Loaded post {submission.id}')
        loaded_posts += 1

        # Save the ID of the last fetched post
        save_last_fetched_post_id(submission.id)

        # Sleep for a while to avoid hitting the rate limit
        time.sleep(0.3)

    logging.info(
        f'Finished scraping posts. Total posts loaded: {loaded_posts}. Total posts skipped: {skipped_posts}.')
    print("Data saved to ", FILENAME)

My code currently functions only for the first 1000 top posts. I've considered a few solutions but ruled them out:

Using the Pushshift API is a solution I'd like to avoid. Skipping old posts and running the script daily for new ones is not preferable. I am open to alternative tools or methods to overcome this limitation. Your suggestions and insights are greatly appreciated. Thank you!

0

There are 0 best solutions below