Querying HTML Content in Common Crawl Dataset Using Amazon Athena

321 Views Asked by Cauder At 06 October 2023 at 01:22

I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web pages to identify those that contain specific strings within their tags. Essentially, I am looking to filter out websites whose HTML content matches particular criteria.

I am aware that Athena is capable of querying large datasets on S3 using standard SQL. However, I am not entirely sure about the feasibility and the approach to directly query inside the HTML content of the web pages in the Common Crawl dataset.

Here's a simplified version of what I am looking to achieve:

sql

SELECT * 
FROM "common_crawl_dataset" 
WHERE html_content LIKE '%specific-string%';

Is it possible to directly query the HTML content of the web pages in the Common Crawl dataset using Athena? If yes, what would be the best approach to accomplish this, considering efficiency and cost-effectiveness? Are there any limitations or challenges that I should be aware of?

Original Q&A

There are 1 best solutions below

RonC On 13 October 2023 at 12:33

I was recently researching how to search common crawl page data for specific phrases. Unfortunately, I don't have a direct answer to your question but I have a bit to share that you may be useful.

The closest I came to finding an example on the web for searching common crawl page data was this reference which was written by Ilya Kreymer who used to work at the Internet Archive and led the Wayback Machine development. Apparently, he created an index of the page data and exposed it as an API, or more specifically an HTTP GET URL endpoint.

In the article he mentions:

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl: https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

More importantly, the code that is used to service this endpoint is open source and available at https://github.com/webrecorder/pywb so it may offer clues that are useful.

Sadly, in my own experiments, the API returned so slow as to be unusable for my intended use.

It's very possible that you already know all of this but I thought it was at least worth mentioning in case it's new info to you.

I hope you receive some other answers that are much better than mine because I too would like to find a performant way to search the contents of the common crawl data.

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in AMAZON-WEB-SERVICES

Related Questions in WEB-CRAWLER

Related Questions in AMAZON-ATHENA

Related Questions in COMMON-CRAWL

Trending Questions

Popular # Hahtags

Popular Questions