fetching image in StormCrawler without indexing them in status

72 Views Asked by aeranginkaman At 23 June 2021 at 04:00

I want to download all images in the web pages and feeding them to some machine learning algorithm for classification and extracting objects within those images. I do not want to index them in the status collection, but I want to extract them in JsoupParser bolt, omit their addresses and download them in topology and feed them to some computer vision algorithm. Is it possible in the StormCrawler?

Original Q&A

There are 1 best solutions below

Julien Nioche On 23 June 2021 at 08:38

If you want to fetch them in the topology, they need to be in the status index. They obviously don't need to be in the content index as there is not text content to query against; you need to write a custom bolt to save the content of the images to whichever form of storage you want. If you run your crawls on EC2, then AWS S3 would be a good fit for example.

Definitely doable with StormCrawler, in fact several companies use it for that purpose.

fetching image in StormCrawler without indexing them in status

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in STORMCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions