Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.
Heritrix 3.2.x , how to read content from warc files ?
563 Views Asked by Jatinder At
3
There are 3 best solutions below
0
On
Have you tried programming a reader using JWAT or use the JWAT Tools command line.
jwattools.cmd extract path.to.warc(.gz)
To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.