Assuming I have:
- the link of the CC*.warc file (and the file itself, if it helps);
- offset; and
- length
How can I get the HTML content of that page?
Thanks for your time and attention.
Assuming I have:
How can I get the HTML content of that page?
Thanks for your time and attention.
On
The below command worked for me.
warcio extract --payload CC-MAIN-20230928130936-20230928160936-00336.warc.gz 988155426 > nutsaboutmoney_from_warcio.html
The html file can then be read by a browser.
This was run in a virtual environment using python 3.11.6. warcio had been installed in the virtual environment with: pip instal warcio
Using warcio it would be simply:
Alternatively, fetch the WARC record using the HTTP range request and then extract the payload at offset 0:
The range starts at offset and ends at offset+length-1. See also getting WARC file