Why does the scrapy spider file run so slowly(/not run properly) after being imported in main file?

28 Views Asked by At

I wrote a scrapy spider which runs normally in the python script and can scrap texts. When I imported it as a module in the main.py file and tried to run it as a module there, my IDE just keeps running and no text gets scrapped. The import was correct (I tested by writting a print function). Nothing happens after following texts. I waited for like an hour. There is no problem concerning RAM. My IDE is spyder.

Here is the messages in the console:

request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

Does anyone have any idea what the problem could be? Thank you very much!

  1. Here are the codes in my python file for the scrapy spider:
import scrapy
import urllib.parse
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

#%%
#Define Items for the pipeline
class CustomItem(Item):
    links = Field()
    text = Field()

#%%
#Define the spider
#Get the links from the website
class SecondSpiderSpider(scrapy.Spider):
    name = "second_spider"
    allowed_domains = ["www1.wdr.de"]
    start_urls = ["https://www1.wdr.de/abisz120.html"]
    
    custom_settings = {
        'DEPTH_LIMIT': 0,
        'DOWNLOAD_DELAY': 5,
        'ITEM_PIPELINES':{'bremenspider.pipelines.MongoDBPipeline2':300},
        'MONGO_COLLECTION_NAME2': 'human_written_texts(wdr)' # Collection for SecondSpider
    }
    
#%%        
#Get the links from the website https://www1.wdr.de/
    def parse(self, response):
        links1 = response.css('ul.list a::attr(href)').getall()
        base_link = 'https://www1.wdr.de/'
        for link in links1:
            absolute_url = urllib.parse.urljoin(base_link, link)
            yield scrapy.Request(url=absolute_url, callback=self.link_parse)
            
#%%          
#Get the links from the links one layer deeper than the mentioned the website           
    def link_parse(self, response):
        links2 = response.css('div.teaser a::attr(href)').getall()
        base_link2 = 'https://www1.wdr.de/'
        for link in links2:
            absolute_url2 = urllib.parse.urljoin(base_link2, link)
            yield scrapy.Request(absolute_url2, callback=self.text_parse,
                                 meta={'url': absolute_url2})
#%%
#Get the text from the links
    def text_parse(self, response):
        item = CustomItem()
        item['text'] = response.xpath('//p[@class="text small"]/descendant-or-self::*/text()').getall()
        item['links'] = response.meta['url']
        extracted_text = ' '.join(item['text']).strip()
        if extracted_text:
            yield item
            
#%%
# Define a function to run the spider from main file.

def run_wdr_spider():
    process = CrawlerProcess(get_project_settings())
    process.crawl(SecondSpiderSpider)
    try:
        process.start(stop_after_crawl=False)
    except KeyboardInterrupt:
        # Handle KeyboardInterrupt (Ctrl+C)
        process.stop()

if __name__ == "__main__":
    run_wdr_spider()
    
  1. Here are the codes in the main.py file:
#%%
import sys
import os
import subprocess
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


#Module Text Crawling. 
#Texte aus der Webseite crawlen.
import wdrspider

#Call a function to crawl texts from the website of wdr.
wdrspider.run_wdr_spider()
0

There are 0 best solutions below