Scrapy-playwright with multiple start_urls

188 Views Asked by At

Similar problem was discussed here but I was not able to make my code work. The aim is to scrapy-playwright generate request-response for each URL in start_urls and parse each response the same way. CSV with urls is correctly read to a list but requests are not generated by start_requests. See the commented code below.

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file]
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls:    
            yield scrapy.Request(
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }   

Do you have an idea how to correctly feed the URLs to the spider? Thank you!

2

There are 2 best solutions below

0
Alexander On BEST ANSWER

You are trying to iterate an empty sequence in your for loop instead of the one extracted from the csv file.

Unless explicitly overwritten self.start_urls will always refer to an empty list that is created in the scrapy.Spider constructor. Removing the self part of self.start_urls should solve your problem.

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file] 
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls: # <- change self.start_urls to just start_urls
            yield scrapy.Request(  #-----------------------------------
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }  

0
Tappetinoorange On

Problem

The error is generated by your for u in self.start_urls, because you are iterating a loop with an empty list.

In the def start_requests(self) function you are using start_urls = [line.strip() for line in file]. While in the for u loop in self.start_urls, you are using self.start_urls with self. As you can see one has self and the other doesn't and because of this you are iterating a loop with an empty list.

Solution

There are two types of solutions for to scrapy-playwright generate request-response for each URL in start_urls and parse each response the same way.

#1 Solution

The first solution (but I'm not sure if it works) is to add self to start_urls:

   def start_requests(self):
    with open('FullLink.csv') as file:
        self.start_urls = [line.strip() for line in file] #EDIT HERE, WITH SELF
    print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
    
    for u in self.start_urls: #WITH SELF

#2 Solution

The second solution, which will work safely, is based on a simpler approach: remove self from for u in self.start_urls (so use start_urls everywhere without self: without self both in for u in start_urls and in start_urls = [line.strip() for line in file ]), and then write:

def start_requests(self):
    with open('FullLink.csv') as file:
        start_urls = [line.strip() for line in file] #NO SELF
    print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
    
    for u in start_urls: #EDIT HERE, NO SELF

Everything else in your code is correct. You just need to edit the part with self