I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my solr collection. (I do not want to add pages that don't match this query)
public void visit(Page page)
{
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
System.out.println("ContentType: " + page.getContentType());
if(page.getParseData() instanceof HtmlParseData) {
String title, text;
HtmlParseData theHtmlParseData = (HtmlParseData) page.getParseData();
title = theHtmlParseData.getTitle();
text = theHtmlParseData.getText();
if ( (title.toLowerCase().contains(" word1 ") && title.toLowerCase().contains(" word2 ")) || (text.toLowerCase().contains(" word1 ") && text.toLowerCase().contains(" word2 ")) ) {
//
// submit to SOLR server
//
submit(page);
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
failedcounter = 0;// we start counting for 3 consecutive pages
} else {
failedcounter++;
}
if (failedcounter == 3) {
failedcounter = 0; // we start counting for 3 consecutive pages
int parent = page.getWebURL().getParentDocid();
parent....HtmlParseData.setOutgoingUrls(null);
}
my question is, how do I edit the last line of this code so that i can retrieve the parent "page object" and delete its outgoing urls, so that the crawl moves on to the rest of the subdomains. Currently i cannot find a function that can get me from the parent id to the page data, for deleting the urls.
The
visit(...)method is called as one of the last statements ofprocessPage(...)(line 523 inWebCrawler).The outgoing links are already added to the crawler's
frontier(and might be processed by other crawler processes as soon as they are added).You could define the behaviour described by adjusting the
shouldVisit(...)or (depending on the exact use-case) inshouldFollowLinksIn(...)of the crawler