How to path args to shouldVisit() method in crawler4j?

120 Views Asked by At

I want to pass arguments to should Visit() method in crawler4j . I saw example for documentation library page on github which uses Factory way but I cant understand it.. Please someone provide a sample example to achieve that

1

There are 1 best solutions below

0
rzo1 On

Variant 1: Injecting additional parameters as constructor arguments

Additionl arguments besides the method parameters of shouldVisit(...), need to be passed as constructor arguments into every single WebCrawler class.

That means, you can do the following to achieve it by using a factory class:

MyWebCrawler.class with two custom arguments (customArgument1 and customArgument2):

public class MyWebCrawler extends WebCrawler {

    private final String customArgument1;
    private final String customArgument2;

    public MyWebCrawler(String customArgument1, String customArgument2) {
        this.customArgument1 = customArgument1;
        this.customArgument2 = customArgument2; 
    }

    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        return customArgument1.equals(href) || customArgument2.equals(href);;
    }

    @Override
    public void visit(Page page) {
        //do something
    }
}

For this to work, the factory should be something like this:

public class MyCrawlerFactory implements CrawlController.WebCrawlerFactory<MyWebCrawler> {

        public MyCrawlerFactory newInstance() throws Exception {
        return new MyCrawlerFactory("some argument", "some other argument");
    }
}

Every time a new instance of MyWebCrawler is created, you can pass your custom arguments.

To use the factory, you would start the crawling process from your CrawlController like this:

controller.start(new MyCrawlerFactory(), numberOfCrawlers);

A similar working example can be found at the official GitHub repository.

Variant 2: Using CrawlController#getCustomData() (deprecated)

You can use customData on the CrawlController object to inject additional data into your web-crawler objects. However, this is the deprecated way and might be removed in future releases of crawler4j.