Extracting the payload of a single Common Crawl WARC

1.3k Views Asked by At

I can query all occurences of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come across a way to extract the raw html for that article from the specific crawl-data (filename in output), even though I know the offset and length of the data I want. I feel like there should be a way to do this in python (similar to this warcio command), maybe using requests and warcio, but I'm not sure. Any help is greatly appreicated.

EDIT:

I found exactly what I needed in this notebook:

import requests
import pathlib
import json
from pprint import pprint

news_website_base = 'hobbsnews.com'
URL = "https://index.commoncrawl.org/CC-MAIN-2022-05-index?url="+news_website_base+"/*&output=json"
website_output = requests.get(URL)
pathlib.Path('data.json').write_bytes(website_output.content)

news_articles = []
test_article_num=300
for line in open('data.json', 'r'):
    news_articles.append(json.loads(line))

pprint(news_articles[test_article_num]) 

news_URL=news_articles[test_article_num]['url']
news_warc_file=news_articles[test_article_num]['filename']
news_offset=news_articles[test_article_num]['offset']
news_length=news_articles[test_article_num]['length']

Code output:

{'digest': 'GY2UDG4G3V3S5TXDL3H7HE6VCSRBD3XR',
 'filename': 'crawl-data/CC-MAIN-2022-05/segments/1642320303729.69/crawldiagnostics/CC-MAIN-20220122012907-20220122042907-00614.warc.gz',
 'length': '40062',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'offset': '21016412',
 'status': '404',
 'timestamp': '20220122015439',
 'url': 'https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link',
 'urlkey': 'com,hobbsnews)/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/{{%20data.link'}
https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link
crawl-data/CC-MAIN-2022-05/segments/1642320300343.4/crawldiagnostics/CC-MAIN-20220117061125-20220117091125-00631.warc.gz
21016412
40062
2

There are 2 best solutions below

2
Sebastian Nagel On BEST ANSWER

With the WARC URL, and WARC record offset and length it's simply:

  • download the range from offset until offset+length-1
  • pass the downloaded bytes to a WARC parser

Using curl and warcio CLI:

curl -s -r250975924-$((250975924+6922-1)) \
   https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-10/segments/1614178365186.46/warc/CC-MAIN-20210303012222-20210303042222-00595.warc.gz \
   >warc_temp.warc.gz
warcio extract --payload warc_temp.warc.gz 0

Or with Python requests and warcio (cf. here):

import io

import requests
import warcio

warc_filename = 'crawl-data/CC-MAIN-2021-10/segments/1614178365186.46/warc/CC-MAIN-20210303012222-20210303042222-00595.warc.gz'
warc_record_offset = 250975924
warc_record_length = 6922

response = requests.get(f'https://data.commoncrawl.org/{warc_filename}',
                        headers={'Range': f'bytes={warc_record_offset}-{warc_record_offset + warc_record_length - 1}'})

with io.BytesIO(response.content) as stream:
    for record in warcio.ArchiveIterator(stream):
        html = record.content_stream().read()
4
Chuck_Berry On

You can find out if a webpage (url) is present in crawl by, for example: curl -0 --retry 1000 --retry-all-errors --retry-delay 1 https://index.commoncrawl.org/CC-MAIN-2023-40-index?url=nutsaboutmoney.com

This will provide you with the length and offset as shown in the result:

com,nutsaboutmoney)/ 20230928151425 {"url": "https://www.nutsaboutmoney.com/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "6U7WFSNZIRCXSORGFWW6JDQYGDDPC7XV", "length": "7914", "offset": "988155426", "filename": "crawl-data/CC-MAIN-2023-40/segments/1695233510412.43/warc/CC-MAIN-20230928130936-20230928160936-00336.warc.gz", "languages": "eng", "encoding": "UTF-8"}

You can then download the file CC-MAIN-20230928130936-20230928160936-00336.warc.gz, for example with: wget -c -t 0 --retry-on-http-error=503 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/segments/1695233510412.43/warc/CC-MAIN-20230928130936-20230928160936-00336.warc.gz

You can then extract the data in the .warc.gz file at the offset with this command: warcio extract --payload CC-MAIN-20230928130936-20230928160936-00336.warc.gz 988155426 > nutsaboutmoney_from_warcio.html

Note: commands were run in a python virtual environment where warcio had been installed with: pip install warcio.