how to setup celery beat for scrapy project?

143 Views Asked by At

I have a scrapy project and I want to run my spider every day so I use celery to do that. this is my tasks.py file:

from celery import Celery, shared_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy_project.scrapy_project.spiders import myspider

app = Celery('tasks', broker='redis://localhost:6379/0')

@shared_task
def scrape_news_website():
    print('SCRAPING RIHGT NOW!')
    setting = get_project_settings()
    process = CrawlerProcess(get_project_settings())
    process.crawl(myspider)
    process.start(stop_after_crawl=False)

I've set stop_after_crawl=False because when it is True then after the first scrape I get this error:

raise error.ReactorNotRestartable() 
twisted.internet.error.ReactorNotRestartable

now with setting stop_after_crawl to False another problem shows up and the problem is that after four(it is four because concurrency is four) times of scraping celery worker doesn't work anymore and it doesn't do tasks because previous crawlprocesses are still running so there is no free worker child process. I don't know how to fix it. I would appreciate your help.

1

There are 1 best solutions below

0
Сергей Мельник On

The issue you're facing with Celery and Scrapy seems to be related to the fact that Scrapy's reactor is not restartable by default, and when you set stop_after_crawl=False, it keeps the reactor running even after a crawl, which can cause issues when trying to run multiple crawls in the same process. Here's how you can solve these problems:

Try this variant for fix this problem.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy_project.scrapy_project.spiders import myspider

def run_spider():
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(myspider)
    process.start()

@shared_task
def scrape_news_website():
    print('SCRAPING RIGHT NOW!')
    run_spider()

Regarding the issue where the Celery worker doesn't work anymore after multiple scrapes, you should ensure that you manage the worker child processes properly.