I have been working with web crawler Heritrix recently in my company where i work for and after a while searching and testing it I can't find how to solve our need.
We want to run heritrix automatically in cron everyday to crawl a list of webpages and what we want to do is to check if any link of that webs are pointing to webs on our domains list. The difficult part and don't find the way is to log all the trace to that link that points to one our domains.
As the job's log file stores all the links with some information but not the trace. An example is run an script when job is done to grep brazzers that is a domain in the list, so if it finds "brazzers" in the crawl log it should show as a result in another log with the whole trace from start to end:
2015-10-25T20:18:58.369Z 200 91 http://cdn1.ads.brazzers.com/robots.txt XLEP http://cdn1.ads.brazzers.com/ text/plain #021 20151025201857643+726 sha1:CPA63O5POU3CVLCH3VDDIMBJCCWRVLPC - -
Is it possible to do this?, or other way?. Feel very stupid with this stuff and i am not very good in programming
Thank you very much in advance
Enrique.
Actually there is a way to analyse the final log for the crawl job when it finishes. Thanks to the response of a heritrix developer (https://groups.yahoo.com/neo) I have now the rule to get the trace of the web link:
Having this, one way to sort out the lines in the log file to build the web link trace is to create an snippet for example in PHP as an example following the rules given