Scraping "Hidden" Content in Google Group with Scrapy

24 Views Asked by At

I'm scraping messages on Google Groups, and in general the script works as I intend. It grabs the following and outputs them to csv or json, depending on my needs/mood:

  • name of poster
  • topic
  • date of post
  • message content

The code I'm using basically counts how many messages there are, then iterates over them to get the content. The issue I'm having is when I encounter threads with more than 100 messages -- I can't scrape them. For one example, see this conversation

Using the dev tools in Firefox, it seems that after the 100th message or so there's a simple message stating "Some nodes were hidden" with a clickable item to reveal these nodes (see image).

dev tools stating there's hidden content

These "hidden" rows appear in the browser if I scroll, but I can't scrape them. My question is if there's a way to access these "hidden" rows using only Python.

Each individual message is in its own section (e.g. //section[i]/...). So, I tried manually designating the number of posts I want to scrape (instead of having the script generate the number), but that doesn't work -- it just produces extra blank lines in the csv.

Here's the relevant part of the code I'm working with:

 def parse(self, response):

#Count the number of messages in the thread for iterating 
        total_tables = int(len(response.xpath('//section[@jscontroller="ywEdOe"]').getall()))
        total_reviews = range(1, (total_tables + 1))

#Used this to see if providing a hard number would get the content I want when used, I commented out the #code above
        #total_reviews = range(1, 150)


        title = response.xpath('//h1/html-blob/text()').get(),

        for i in total_reviews:
            date = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div[1]/div[1]/div[2]/span[1]/text()').get(),
            name = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div/div[1]/div[1]/h3//text()').get(),

            post_content = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div[2]/descendant-or-self::*/text()').getall(),

            yield {
                'date': date,
                'name': name,
                'title': title,
                'post_content': post_content,

            }

Appreciate any suggestions. I'm new at this so probably overlooked something.

0

There are 0 best solutions below