I have a scraper bot, which works fine. But as time passes when it is scraping the speed gets down.
I added concurrent request, download_delay:0,'AUTOTHROTTLE_ENABLED':False but result is same. It is starting with a fast pace but gets slower.
I guess it is about caching, but do not know if I have to clean cache, or why it behaves so?
The code is below would like to hear comments;
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
import scrapy_xlsx
itemList=[]
class plateScraper(scrapy.Spider):
name = 'scrapePlate'
allowed_domains = ['dvlaregistrations.dvla.gov.uk']
FEED_EXPORTERS = {'xlsx': 'scrapy_xlsx.XlsxItemExporter'}
custom_settings = {'FEED_EXPORTERS' :FEED_EXPORTERS,'FEED_FORMAT': 'xlsx','FEED_URI': 'output_r00.xlsx', 'LOG_LEVEL':'INFO','DOWNLOAD_DELAY': 0,'CONCURRENT_ITEMS':300,'CONCURRENT_REQUESTS':30,'AUTOTHROTTLE_ENABLED':False}
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
global plate_num_xlsx
plate_num_xlsx=row
base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
url=base_url
yield scrapy.Request(url,callback=self.parse, cb_kwargs={'plate_num_xlsx': plate_num_xlsx})
def parse(self, response, plate_num_xlsx=None):
plate = response.xpath('//div[@class="resultsstrip"]/a/text()').extract_first()
price = response.xpath('//div[@class="resultsstrip"]/p/text()').extract_first()
try:
a = plate.replace(" ", "").strip()
if plate_num_xlsx == plate.replace(" ", "").strip():
item = {"plate": plate_num_xlsx, "price": price.strip()}
itemList.append(item)
print(item)
yield item
else:
item = {"plate": plate_num_xlsx, "price": "-"}
itemList.append(item)
print(item)
yield item
except:
item = {"plate": plate_num_xlsx, "price": "-"}
itemList.append(item)
print(item)
yield item
process = CrawlerProcess()
process.crawl(plateScraper)
process.start()
import winsound
winsound.Beep(555,333)
EDIT: "log_stats"
{'downloader/request_bytes': 1791806,
'downloader/request_count': 3459,
'downloader/request_method_count/GET': 3459,
'downloader/response_bytes': 38304184,
'downloader/response_count': 3459,
'downloader/response_status_count/200': 3459,
'dupefilter/filtered': 6,
'elapsed_time_seconds': 3056.810985,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 27, 22, 31, 17, 17188),
'httpcompression/response_bytes': 238767410,
'httpcompression/response_count': 3459,
'item_scraped_count': 3459,
'log_count/INFO': 61,
'log_count/WARNING': 2,
'response_received_count': 3459,
'scheduler/dequeued': 3459,
'scheduler/dequeued/memory': 3459,
'scheduler/enqueued': 3459,
'scheduler/enqueued/memory': 3459,
'start_time': datetime.datetime(2023, 1, 27, 21, 40, 20, 206203)}
2023-01-28 02:31:17 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
On a very first look code looks Ok. Hovewer I see several points that can lead to increasing scraping speed here:
CONCURRENT_REQUESTS_PER_DOMAINsetting - as it didn't changed it keep default vaue 8 (no more than 8 requests at the same time). Recommended to increase it up to value ofCONCURRENT_REQUESTS.CONCURRENT_ITEMSsetting - we had several reports that increasing value of this setting may lead to degrated performance -/scrapy/issues/5182. Recommended to keep it default..xlsxfile - is zipped archive with xml documents. It usesopenpyxlthat keep all file contents and it's parsed xml trees in RAM memory. Each added row increase size of xml tree of created.xlsxfile as result of this adding each new row - can be more CPU intensive. - Recommented to compare scraping speed with usage of scrapy built-ins feed exporters (csv or json lines).