Heritrix 3.2.x , how to read content from warc files ?

563 Views Asked by At

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.

3

There are 3 best solutions below

0
zuups On

To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.

0
YMomb On

Have you tried programming a reader using JWAT or use the JWAT Tools command line.

jwattools.cmd extract path.to.warc(.gz)
0
Du-Lacoste On

Using the same version of Heritrix you are using. For the playbacks, the OpenWayBack is used.

The OpenWayBack is bundled with CDX-Indexer which could be used to extract the contents which is written to a CDX file where you can obtain the HTML links etc.