How to understand partitions vs indexing in AWS Athena?

40 Views Asked by At

I have ~250 gzipped JSONL files (150mb each) sitting in an S3 bucket that i'm unnesting into a flattened table (Table 1). I'm then using CTAS to write the results into a table WITH format=Parquet (Table 2) so that I'm not unnesting at the query layer. My files in S3 are just subsets of a larger file, they aren't organized by date or anything similar.

Sample cols I'm importing :

  • User ID (string)
  • Department (string)
  • Start Date (date)

Problem: I'm trying to partition the results so I can better query them, but my understanding of partitioning is that I can't partition columns that I'm loading into Table 1, because I'll get the "column repeated in partitioning columns" error, which I've been getting. During Table 1 creation, I can add partitions to unimported columns like Phone Number, Email, etc, but I don't think that helps me.

Questions:

  • Am I confusing partitioning for indexing? Do I WANT to add partitions for my unimported columns so that my imported columns are slightly more grouped together?
  • Should I be BUCKETING BY my imported columns? Tried that as well and had no success
  • Is AWS Glue something I need here?

Thanks in advance for any help!

0

There are 0 best solutions below