I have ~250 gzipped JSONL files (150mb each) sitting in an S3 bucket that i'm unnesting into a flattened table (Table 1). I'm then using CTAS to write the results into a table WITH format=Parquet (Table 2) so that I'm not unnesting at the query layer. My files in S3 are just subsets of a larger file, they aren't organized by date or anything similar.
Sample cols I'm importing :
- User ID (string)
- Department (string)
- Start Date (date)
Problem: I'm trying to partition the results so I can better query them, but my understanding of partitioning is that I can't partition columns that I'm loading into Table 1, because I'll get the "column repeated in partitioning columns" error, which I've been getting. During Table 1 creation, I can add partitions to unimported columns like Phone Number, Email, etc, but I don't think that helps me.
Questions:
- Am I confusing partitioning for indexing? Do I WANT to add partitions for my unimported columns so that my imported columns are slightly more grouped together?
- Should I be BUCKETING BY my imported columns? Tried that as well and had no success
- Is AWS Glue something I need here?
Thanks in advance for any help!