I'm scraping messages on Google Groups, and in general the script works as I intend. It grabs the following and outputs them to csv or json, depending on my needs/mood:
- name of poster
- topic
- date of post
- message content
The code I'm using basically counts how many messages there are, then iterates over them to get the content. The issue I'm having is when I encounter threads with more than 100 messages -- I can't scrape them. For one example, see this conversation
Using the dev tools in Firefox, it seems that after the 100th message or so there's a simple message stating "Some nodes were hidden" with a clickable item to reveal these nodes (see image).
dev tools stating there's hidden content
These "hidden" rows appear in the browser if I scroll, but I can't scrape them. My question is if there's a way to access these "hidden" rows using only Python.
Each individual message is in its own section (e.g. //section[i]/...). So, I tried manually designating the number of posts I want to scrape (instead of having the script generate the number), but that doesn't work -- it just produces extra blank lines in the csv.
Here's the relevant part of the code I'm working with:
def parse(self, response):
#Count the number of messages in the thread for iterating
total_tables = int(len(response.xpath('//section[@jscontroller="ywEdOe"]').getall()))
total_reviews = range(1, (total_tables + 1))
#Used this to see if providing a hard number would get the content I want when used, I commented out the #code above
#total_reviews = range(1, 150)
title = response.xpath('//h1/html-blob/text()').get(),
for i in total_reviews:
date = response.xpath(
'//section[' + str(i) + ']/div/div[1]/div[2]/div[1]/div[1]/div[2]/span[1]/text()').get(),
name = response.xpath(
'//section[' + str(i) + ']/div/div[1]/div[2]/div/div[1]/div[1]/h3//text()').get(),
post_content = response.xpath(
'//section[' + str(i) + ']/div/div[1]/div[2]/div[2]/descendant-or-self::*/text()').getall(),
yield {
'date': date,
'name': name,
'title': title,
'post_content': post_content,
}
Appreciate any suggestions. I'm new at this so probably overlooked something.