How to understand partitions vs indexing in AWS Athena?

40 Views Asked by Ethan E At 23 February 2024 at 17:20

I have ~250 gzipped JSONL files (150mb each) sitting in an S3 bucket that i'm unnesting into a flattened table (Table 1). I'm then using CTAS to write the results into a table WITH format=Parquet (Table 2) so that I'm not unnesting at the query layer. My files in S3 are just subsets of a larger file, they aren't organized by date or anything similar.

Sample cols I'm importing :

User ID (string)
Department (string)
Start Date (date)

Problem: I'm trying to partition the results so I can better query them, but my understanding of partitioning is that I can't partition columns that I'm loading into Table 1, because I'll get the "column repeated in partitioning columns" error, which I've been getting. During Table 1 creation, I can add partitions to unimported columns like Phone Number, Email, etc, but I don't think that helps me.

Questions:

Am I confusing partitioning for indexing? Do I WANT to add partitions for my unimported columns so that my imported columns are slightly more grouped together?
Should I be BUCKETING BY my imported columns? Tried that as well and had no success
Is AWS Glue something I need here?

Thanks in advance for any help!

Original Q&A

How to understand partitions vs indexing in AWS Athena?

There are 0 best solutions below

Related Questions in AMAZON-WEB-SERVICES

Related Questions in INDEXING

Related Questions in AMAZON-ATHENA

Related Questions in PARTITIONING

Trending Questions

Popular # Hahtags

Popular Questions