Common Crawl data search all pages by keyword

1.3k Views Asked by Python 123 At 26 March 2021 at 04:26

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will find the pages which the key word - "stack overflow" is in the HTML file. I have looked at the apis but I can only do URL lookup - not key word. Thank you for any responses in advance!

Original Q&A

There are 1 best solutions below

NameKhan72 On 31 March 2021 at 11:07

I, if I were you, would not use CommonCrawl for this. To use CommonCrawl, you would have to iterate over the entire CommonCrawl-Dataset. That's 2.8 billion webpages!

My suggested alternative would be to use Microsoft's Bing WebSearch-API. You get an easy to use API with 1000 free uses per month.

Searching through this API would yield webpages containing the queried keyword. From there, you could download the html-source of the webpage and iterate through it again within python to find all uses of your keyword.

Common Crawl data search all pages by keyword

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in API

Related Questions in WEB-CRAWLER

Related Questions in KEYWORD-SEARCH

Related Questions in COMMON-CRAWL

Trending Questions

Popular # Hahtags

Popular Questions