I wrote a scrapy spider which runs normally in the python script and can scrap texts. When I imported it as a module in the main.py file and tried to run it as a module there, my IDE just keeps running and no text gets scrapped. The import was correct (I tested by writting a print function). Nothing happens after following texts. I waited for like an hour. There is no problem concerning RAM. My IDE is spyder.
Here is the messages in the console:
request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
Does anyone have any idea what the problem could be? Thank you very much!
- Here are the codes in my python file for the scrapy spider:
import scrapy
import urllib.parse
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
#%%
#Define Items for the pipeline
class CustomItem(Item):
links = Field()
text = Field()
#%%
#Define the spider
#Get the links from the website
class SecondSpiderSpider(scrapy.Spider):
name = "second_spider"
allowed_domains = ["www1.wdr.de"]
start_urls = ["https://www1.wdr.de/abisz120.html"]
custom_settings = {
'DEPTH_LIMIT': 0,
'DOWNLOAD_DELAY': 5,
'ITEM_PIPELINES':{'bremenspider.pipelines.MongoDBPipeline2':300},
'MONGO_COLLECTION_NAME2': 'human_written_texts(wdr)' # Collection for SecondSpider
}
#%%
#Get the links from the website https://www1.wdr.de/
def parse(self, response):
links1 = response.css('ul.list a::attr(href)').getall()
base_link = 'https://www1.wdr.de/'
for link in links1:
absolute_url = urllib.parse.urljoin(base_link, link)
yield scrapy.Request(url=absolute_url, callback=self.link_parse)
#%%
#Get the links from the links one layer deeper than the mentioned the website
def link_parse(self, response):
links2 = response.css('div.teaser a::attr(href)').getall()
base_link2 = 'https://www1.wdr.de/'
for link in links2:
absolute_url2 = urllib.parse.urljoin(base_link2, link)
yield scrapy.Request(absolute_url2, callback=self.text_parse,
meta={'url': absolute_url2})
#%%
#Get the text from the links
def text_parse(self, response):
item = CustomItem()
item['text'] = response.xpath('//p[@class="text small"]/descendant-or-self::*/text()').getall()
item['links'] = response.meta['url']
extracted_text = ' '.join(item['text']).strip()
if extracted_text:
yield item
#%%
# Define a function to run the spider from main file.
def run_wdr_spider():
process = CrawlerProcess(get_project_settings())
process.crawl(SecondSpiderSpider)
try:
process.start(stop_after_crawl=False)
except KeyboardInterrupt:
# Handle KeyboardInterrupt (Ctrl+C)
process.stop()
if __name__ == "__main__":
run_wdr_spider()
- Here are the codes in the main.py file:
#%%
import sys
import os
import subprocess
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
#Module Text Crawling.
#Texte aus der Webseite crawlen.
import wdrspider
#Call a function to crawl texts from the website of wdr.
wdrspider.run_wdr_spider()