Scrapy-playwright with multiple start_urls

Question

Scrapy-playwright with multiple start_urls

188 Views Asked by ReCodeRa At 22 November 2023 at 08:37

Similar problem was discussed here but I was not able to make my code work. The aim is to scrapy-playwright generate request-response for each URL in start_urls and parse each response the same way. CSV with urls is correctly read to a list but requests are not generated by start_requests. See the commented code below.

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file]
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls:    
            yield scrapy.Request(
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }

Do you have an idea how to correctly feed the URLs to the spider? Thank you!

Original Q&A

There are 2 best solutions below

Tappetinoorange On 23 November 2023 at 01:02

Problem

The error is generated by your for u in self.start_urls, because you are iterating a loop with an empty list.

In the def start_requests(self) function you are using start_urls = [line.strip() for line in file]. While in the for u loop in self.start_urls, you are using self.start_urls with self. As you can see one has self and the other doesn't and because of this you are iterating a loop with an empty list.

Solution

There are two types of solutions for to scrapy-playwright generate request-response for each URL in start_urls and parse each response the same way.

#1 Solution

The first solution (but I'm not sure if it works) is to add self to start_urls:

   def start_requests(self):
    with open('FullLink.csv') as file:
        self.start_urls = [line.strip() for line in file] #EDIT HERE, WITH SELF
    print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
    
    for u in self.start_urls: #WITH SELF

#2 Solution

The second solution, which will work safely, is based on a simpler approach: remove self from for u in self.start_urls (so use start_urls everywhere without self: without self both in for u in start_urls and in start_urls = [line.strip() for line in file ]), and then write:

def start_requests(self):
    with open('FullLink.csv') as file:
        start_urls = [line.strip() for line in file] #NO SELF
    print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
    
    for u in start_urls: #EDIT HERE, NO SELF

Everything else in your code is correct. You just need to edit the part with self

**Alexander** · Accepted Answer · 2023-11-23T00:14:56.107000

You are trying to iterate an empty sequence in your for loop instead of the one extracted from the csv file.

Unless explicitly overwritten self.start_urls will always refer to an empty list that is created in the scrapy.Spider constructor. Removing the self part of self.start_urls should solve your problem.

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file] 
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls: # <- change self.start_urls to just start_urls
            yield scrapy.Request(  #-----------------------------------
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }

Scrapy-playwright with multiple start_urls

There are 2 best solutions below

Problem

Solution

#1 Solution

#2 Solution

Related Questions in PYTHON

Related Questions in SCRAPY

Related Questions in PLAYWRIGHT

Related Questions in SCRAPY-PLAYWRIGHT

Trending Questions

Popular # Hahtags

Popular Questions