How to access Columnar URL INDEX using Amazon Athena

243 Views Asked by Gladiator At 16 November 2025 at 03:47

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC

And I keep getting this error:

Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-05/subset=warc/part-00082-248eba37-08f7-4a53-a4b4-d990640e4be4.c000.gz.parquet (offset=0, length=33554432): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ZSRS4FD2ZTNJY9PV; S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=; Proxy: null), S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=

What's the reason? And how do I resolve it?

Original Q&A

There are 1 best solutions below

Robert Kossendey On 08 January 2023 at 15:29

You are hitting the request rate limit of S3 since your query is trying to access too many parquet files at the same time. Consider compacting the underlying files into less.

How to access Columnar URL INDEX using Amazon Athena

There are 1 best solutions below

Related Questions in AMAZON-WEB-SERVICES

Related Questions in AMAZON-S3

Related Questions in AMAZON-ATHENA

Related Questions in COMMON-CRAWL

Trending Questions

Popular # Hahtags

Popular Questions