fetching image in StormCrawler without indexing them in status

72 Views Asked by At

I want to download all images in the web pages and feeding them to some machine learning algorithm for classification and extracting objects within those images. I do not want to index them in the status collection, but I want to extract them in JsoupParser bolt, omit their addresses and download them in topology and feed them to some computer vision algorithm. Is it possible in the StormCrawler?

1

There are 1 best solutions below

0
Julien Nioche On

If you want to fetch them in the topology, they need to be in the status index. They obviously don't need to be in the content index as there is not text content to query against; you need to write a custom bolt to save the content of the images to whichever form of storage you want. If you run your crawls on EC2, then AWS S3 would be a good fit for example.

Definitely doable with StormCrawler, in fact several companies use it for that purpose.