we are running a basic StormCrawler topology, which which we want to opt for high throughput.
Topology:
- FrontierSpout
- Fetcher
- JSoupParser
- DummyIndexer
- WarcBolt
- urlfrontier.StatusUpdaterBolt
However, after running the topology for a small amount of time, the cache of StatusUpdaterBolt seem to run full and the tuples fail (Screenshot from Storm UI).
The situation looks as follows in the Grafana dashboard of the URLFrontier (two Screenshots from Grafana dashboard; first shortly after start and second after a few minutes)
This fail happens especially early, when the URLFrontier is running on a remote server and not locally besides the crawler. However, for our use case of the StormCrawler it is a crucial point that the crawlers run on different servers, which might even be separated by a high geographic distance, and nevertheless they share one Frontier. How is it possible to achieve high throughput and crawling speed in this distributed setup, without the StatusUpdaterBolt running full?
What we have tried so far, is the following: We modified the crawling pipeline, such that DISCOVERED urls are written into a log file, which is lateron uploaded to the Frontier, whereas the remaining urls are uploaded to the URLFrontier in the normal way (PutURLs command).
Is there another way to address this issue and overcome the bottleneck? Something like:
- how to configure the urlfrontier.StatusUpdaterBolt properly, to prevent that the cache runs full?
- change the implementation of the waitAck cache in the StatusUpdaterBolt?
- or even waive the waitAck cache completely? This would mean that all tuples are acked and no tuples fail, so the communication between crawler and Frontier would follow the principle "Fire & Forget". What consequences would that have and would it be a meaningful tradeoff for crawling speed?
Thank you very much for your help!
1st test: Running the described setup with URLFrontier (default RocksDB implementation) In this case, the Frontier becomes irresponsive and the StormCrawler slows down, as it doesn't receive any new URLs from the Spout. The crawler is configured with
parallelism: 10for the Spout, which presumably puts a too high workload on the URLFrontier. Here are the corresponding logs:2nd test: Running with URLFrontier (modified OpenSearch implementation) Due to the modifications on the URLFrontier, the retrieval of URLs from the Frontier is not a bottleneck for the crawler anymore. You can see it, when looking at the following log entries from the Frontier:
So, when aiming for a higher throughput in the interplay between StormCrawler and URLFrontier, the performance bottleneck has been the retrieval of URLs, so the
GetURLs. Apparently in the setup, it seems that now the upload of URLs via the StatusUpdaterBolt and thePutURLscommand has become the performance bottleneck. So the crawler crash, which is depicted in the Screenshot of Storm UI from above, is unfortunately difficult to reproduce with the default implementation of URLFrontier.In
worker.logof SC, I have encounter many of the following log entries:It seems that the waitAck cache is constantly full. This problem might be related to the URLFrontier, which is presumably not able to handle all the
PutURLsand keep up with the crawling speed. So it is probably necessary to further improve the software of the URLFrontier. My original question targeted the following: Is it from the perspective of the StormCrawler necessary, to wait for the ACK confirmation of the frontier? Could this be omitted as a trade-off for a higher crawling speed?