Scraping "Hidden" Content in Google Group with Scrapy

24 Views Asked by simishu At 29 January 2024 at 07:04

I'm scraping messages on Google Groups, and in general the script works as I intend. It grabs the following and outputs them to csv or json, depending on my needs/mood:

name of poster
topic
date of post
message content

The code I'm using basically counts how many messages there are, then iterates over them to get the content. The issue I'm having is when I encounter threads with more than 100 messages -- I can't scrape them. For one example, see this conversation

Using the dev tools in Firefox, it seems that after the 100th message or so there's a simple message stating "Some nodes were hidden" with a clickable item to reveal these nodes (see image).

dev tools stating there's hidden content

These "hidden" rows appear in the browser if I scroll, but I can't scrape them. My question is if there's a way to access these "hidden" rows using only Python.

Each individual message is in its own section (e.g. //section[i]/...). So, I tried manually designating the number of posts I want to scrape (instead of having the script generate the number), but that doesn't work -- it just produces extra blank lines in the csv.

Here's the relevant part of the code I'm working with:

 def parse(self, response):

#Count the number of messages in the thread for iterating 
        total_tables = int(len(response.xpath('//section[@jscontroller="ywEdOe"]').getall()))
        total_reviews = range(1, (total_tables + 1))

#Used this to see if providing a hard number would get the content I want when used, I commented out the #code above
        #total_reviews = range(1, 150)


        title = response.xpath('//h1/html-blob/text()').get(),

        for i in total_reviews:
            date = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div[1]/div[1]/div[2]/span[1]/text()').get(),
            name = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div/div[1]/div[1]/h3//text()').get(),

            post_content = response.xpath(
                '//section[' + str(i) + ']/div/div[1]/div[2]/div[2]/descendant-or-self::*/text()').getall(),

            yield {
                'date': date,
                'name': name,
                'title': title,
                'post_content': post_content,

            }

Appreciate any suggestions. I'm new at this so probably overlooked something.

Original Q&A

Scraping "Hidden" Content in Google Group with Scrapy

There are 0 best solutions below

Related Questions in PYTHON-3.X

Related Questions in SCRAPY

Related Questions in GOOGLE-GROUPS

Trending Questions

Popular # Hahtags

Popular Questions