Directing the search depths in Crawler4j Solr

101 Views Asked by At

I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my solr collection. (I do not want to add pages that don't match this query)

public void visit(Page page)
{
    int docid = page.getWebURL().getDocid();
    String url = page.getWebURL().getURL();
    String domain = page.getWebURL().getDomain();
    String path = page.getWebURL().getPath();
    String subDomain = page.getWebURL().getSubDomain();
    String parentUrl = page.getWebURL().getParentUrl();
    String anchor = page.getWebURL().getAnchor();

    System.out.println("Docid: " + docid);
    System.out.println("URL: " + url);
    System.out.println("Domain: '" + domain + "'");
    System.out.println("Sub-domain: '" + subDomain + "'");
    System.out.println("Path: '" + path + "'");
    System.out.println("Parent page: " + parentUrl);
    System.out.println("Anchor text: " + anchor);
    System.out.println("ContentType: " + page.getContentType());

    if(page.getParseData() instanceof HtmlParseData) {
        String title, text;

        HtmlParseData theHtmlParseData = (HtmlParseData) page.getParseData();
        title = theHtmlParseData.getTitle();
        text = theHtmlParseData.getText();

        if (  (title.toLowerCase().contains(" word1 ") && title.toLowerCase().contains(" word2 "))  ||  (text.toLowerCase().contains(" word1 ") && text.toLowerCase().contains(" word2 ")) ) {
            //
            // submit to SOLR server
            //
            submit(page);

            Header[] responseHeaders = page.getFetchResponseHeaders();
            if (responseHeaders != null) {
                System.out.println("Response headers:");
                for (Header header : responseHeaders) {
                    System.out.println("\t" + header.getName() + ": " + header.getValue());
                }
            }

            failedcounter = 0;// we start counting for 3 consecutive pages

        } else {

            failedcounter++;

        }

        if (failedcounter == 3) {

            failedcounter = 0; // we start counting for 3 consecutive pages
            int parent = page.getWebURL().getParentDocid();
            parent....HtmlParseData.setOutgoingUrls(null);

        }

my question is, how do I edit the last line of this code so that i can retrieve the parent "page object" and delete its outgoing urls, so that the crawl moves on to the rest of the subdomains. Currently i cannot find a function that can get me from the parent id to the page data, for deleting the urls.

1

There are 1 best solutions below

0
rzo1 On

The visit(...) method is called as one of the last statements of processPage(...) (line 523 in WebCrawler).

The outgoing links are already added to the crawler's frontier (and might be processed by other crawler processes as soon as they are added).

You could define the behaviour described by adjusting the shouldVisit(...) or (depending on the exact use-case) in shouldFollowLinksIn(...) of the crawler