StormCrawler: urlfrontier.StatusUpdaterBolt performance bottleneck

43 Views Asked by At

we are running a basic StormCrawler topology, which which we want to opt for high throughput.

Topology:

  • FrontierSpout
  • Fetcher
  • JSoupParser
  • DummyIndexer
  • WarcBolt
  • urlfrontier.StatusUpdaterBolt

However, after running the topology for a small amount of time, the cache of StatusUpdaterBolt seem to run full and the tuples fail (Screenshot from Storm UI).

The situation looks as follows in the Grafana dashboard of the URLFrontier (two Screenshots from Grafana dashboard; first shortly after start and second after a few minutes)

This fail happens especially early, when the URLFrontier is running on a remote server and not locally besides the crawler. However, for our use case of the StormCrawler it is a crucial point that the crawlers run on different servers, which might even be separated by a high geographic distance, and nevertheless they share one Frontier. How is it possible to achieve high throughput and crawling speed in this distributed setup, without the StatusUpdaterBolt running full?

What we have tried so far, is the following: We modified the crawling pipeline, such that DISCOVERED urls are written into a log file, which is lateron uploaded to the Frontier, whereas the remaining urls are uploaded to the URLFrontier in the normal way (PutURLs command).

Is there another way to address this issue and overcome the bottleneck? Something like:

  • how to configure the urlfrontier.StatusUpdaterBolt properly, to prevent that the cache runs full?
  • change the implementation of the waitAck cache in the StatusUpdaterBolt?
  • or even waive the waitAck cache completely? This would mean that all tuples are acked and no tuples fail, so the communication between crawler and Frontier would follow the principle "Fire & Forget". What consequences would that have and would it be a meaningful tradeoff for crawling speed?

Thank you very much for your help!

1

There are 1 best solutions below

2
Michael Dinzinger On

1st test: Running the described setup with URLFrontier (default RocksDB implementation) In this case, the Frontier becomes irresponsive and the StormCrawler slows down, as it doesn't receive any new URLs from the Spout. The crawler is configured with parallelism: 10 for the Spout, which presumably puts a too high workload on the URLFrontier. Here are the corresponding logs:

11:50:52.413 [grpc-default-executor-2] INFO  c.u.service.AbstractFrontierService - Sent 31 from 25 queue(s) in 80647 msec; tried 1005705 queues. 6c57df91-69e2-4fb7-af81-7a1e$
11:50:52.443 [grpc-default-executor-2] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 25, max URLs 10, delay 30] 1c6449d8-76e$
11:50:57.533 [grpc-default-executor-12] INFO  c.u.service.AbstractFrontierService - Sent 26 from 25 queue(s) in 60115 msec; tried 1005705 queues. 3f916309-6f8a-4ec6-bac0-83d$
11:50:57.568 [grpc-default-executor-12] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 25, max URLs 10, delay 30] 8f312ecf-36$
11:51:27.429 [grpc-default-executor-9] INFO  c.u.service.AbstractFrontierService - Sent 25 from 25 queue(s) in 79580 msec; tried 1002027 queues. a3628be2-4b32-4e68-a7e2-09c8$

2nd test: Running with URLFrontier (modified OpenSearch implementation) Due to the modifications on the URLFrontier, the retrieval of URLs from the Frontier is not a bottleneck for the crawler anymore. You can see it, when looking at the following log entries from the Frontier:

11:59:40.058 [grpc-default-executor-5] INFO  c.p.urlfrontier.OpensearchService - Sent 34 URLs from 25 host(s) in 1 msec; tried 34 URLs. beedb9a6-5b73-46ef-bdaf-35fd8582ad72
11:59:40.617 [grpc-default-executor-5] INFO  c.p.urlfrontier.OpensearchService - Received request to get fetchable URLs [max queues 25, max URLs 10, delay 30] e8bbda41-7d85-$
11:59:40.618 [grpc-default-executor-5] INFO  c.p.urlfrontier.OpensearchService - Sent 30 URLs from 25 host(s) in 1 msec; tried 30 URLs. e8bbda41-7d85-49f9-b80d-477c51c088dd
11:59:40.662 [grpc-default-executor-5] INFO  c.p.urlfrontier.OpensearchService - Received request to get fetchable URLs [max queues 25, max URLs 10, delay 30] ff7426b2-6809-$
11:59:40.663 [grpc-default-executor-5] INFO  c.p.urlfrontier.OpensearchService - Sent 29 URLs from 25 host(s) in 1 msec; tried 29 URLs. ff7426b2-6809-4737-b373-7e0a9658f8d2

So, when aiming for a higher throughput in the interplay between StormCrawler and URLFrontier, the performance bottleneck has been the retrieval of URLs, so the GetURLs. Apparently in the setup, it seems that now the upload of URLs via the StatusUpdaterBolt and the PutURLs command has become the performance bottleneck. So the crawler crash, which is depicted in the Screenshot of Storm UI from above, is unfortunately difficult to reproduce with the default implementation of URLFrontier.

In worker.log of SC, I have encounter many of the following log entries:

2023-06-23 09:58:53.478 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.pascoonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.478 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.lubbockonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.eastorangeonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://www.ciceroonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.losangelesonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.elkhartonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.carrolltononline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.elmonteonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.danburyonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.rockhillonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.haverhillonline.us from waitAck with 1 values. [EXPIRED]
2023-06-23 09:58:53.479 c.d.s.u.StatusUpdaterBolt ForkJoinPool.commonPool-worker-9 [WARN] Evicted https://jobs.hanoverparkonline.us/retail from waitAck with 1 values. [EXPIRED]

It seems that the waitAck cache is constantly full. This problem might be related to the URLFrontier, which is presumably not able to handle all the PutURLs and keep up with the crawling speed. So it is probably necessary to further improve the software of the URLFrontier. My original question targeted the following: Is it from the perspective of the StormCrawler necessary, to wait for the ACK confirmation of the frontier? Could this be omitted as a trade-off for a higher crawling speed?