I can query all occurences of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come across a way to extract the raw html for that article from the specific crawl-data (filename in output), even though I know the offset and length of the data I want. I feel like there should be a way to do this in python (similar to this warcio command), maybe using requests and warcio, but I'm not sure. Any help is greatly appreicated.
EDIT:
I found exactly what I needed in this notebook:
import requests
import pathlib
import json
from pprint import pprint
news_website_base = 'hobbsnews.com'
URL = "https://index.commoncrawl.org/CC-MAIN-2022-05-index?url="+news_website_base+"/*&output=json"
website_output = requests.get(URL)
pathlib.Path('data.json').write_bytes(website_output.content)
news_articles = []
test_article_num=300
for line in open('data.json', 'r'):
news_articles.append(json.loads(line))
pprint(news_articles[test_article_num])
news_URL=news_articles[test_article_num]['url']
news_warc_file=news_articles[test_article_num]['filename']
news_offset=news_articles[test_article_num]['offset']
news_length=news_articles[test_article_num]['length']
Code output:
{'digest': 'GY2UDG4G3V3S5TXDL3H7HE6VCSRBD3XR',
'filename': 'crawl-data/CC-MAIN-2022-05/segments/1642320303729.69/crawldiagnostics/CC-MAIN-20220122012907-20220122042907-00614.warc.gz',
'length': '40062',
'mime': 'text/html',
'mime-detected': 'text/html',
'offset': '21016412',
'status': '404',
'timestamp': '20220122015439',
'url': 'https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link',
'urlkey': 'com,hobbsnews)/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/{{%20data.link'}
https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link
crawl-data/CC-MAIN-2022-05/segments/1642320300343.4/crawldiagnostics/CC-MAIN-20220117061125-20220117091125-00631.warc.gz
21016412
40062
With the WARC URL, and WARC record offset and length it's simply:
Using curl and warcio CLI:
Or with Python requests and warcio (cf. here):