I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web pages to identify those that contain specific strings within their tags. Essentially, I am looking to filter out websites whose HTML content matches particular criteria.
I am aware that Athena is capable of querying large datasets on S3 using standard SQL. However, I am not entirely sure about the feasibility and the approach to directly query inside the HTML content of the web pages in the Common Crawl dataset.
Here's a simplified version of what I am looking to achieve:
sql
SELECT *
FROM "common_crawl_dataset"
WHERE html_content LIKE '%specific-string%';
Is it possible to directly query the HTML content of the web pages in the Common Crawl dataset using Athena? If yes, what would be the best approach to accomplish this, considering efficiency and cost-effectiveness? Are there any limitations or challenges that I should be aware of?
I was recently researching how to search common crawl page data for specific phrases. Unfortunately, I don't have a direct answer to your question but I have a bit to share that you may be useful.
The closest I came to finding an example on the web for searching common crawl page data was this reference which was written by Ilya Kreymer who used to work at the Internet Archive and led the Wayback Machine development. Apparently, he created an index of the page data and exposed it as an API, or more specifically an HTTP GET URL endpoint.
In the article he mentions:
More importantly, the code that is used to service this endpoint is open source and available at https://github.com/webrecorder/pywb so it may offer clues that are useful.
Sadly, in my own experiments, the API returned so slow as to be unusable for my intended use.
It's very possible that you already know all of this but I thought it was at least worth mentioning in case it's new info to you.
I hope you receive some other answers that are much better than mine because I too would like to find a performant way to search the contents of the common crawl data.