Common Crawl data search all pages by keyword

1.3k Views Asked by At

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will find the pages which the key word - "stack overflow" is in the HTML file. I have looked at the apis but I can only do URL lookup - not key word. Thank you for any responses in advance!

1

There are 1 best solutions below

2
NameKhan72 On

I, if I were you, would not use CommonCrawl for this. To use CommonCrawl, you would have to iterate over the entire CommonCrawl-Dataset. That's 2.8 billion webpages!

My suggested alternative would be to use Microsoft's Bing WebSearch-API. You get an easy to use API with 1000 free uses per month.

Searching through this API would yield webpages containing the queried keyword. From there, you could download the html-source of the webpage and iterate through it again within python to find all uses of your keyword.