Context: I'm using kinesis to stream data from my lambda into an S3 bucket according to a glue schema. Then I run a crawler on my S3 bucket to catalog my data. My data, when written to the kinesis firehose has the following attributes: 'dataset_datetime, attr1, attr2, attr3, attr4...'. I do not define any partitions in neither the data written from lambda nor in my kinesis firehose, nor in my glue catalog. However, when data is stored inside my S3 bucket, it's stored in the following dir structure:
-year -month -day -hour -dataFile.parquet
Then, when I run my crawler over it, my crawler creates 4 additional partition keys which map to year, month, day and hour. I don't want these attributes being created...
Question: Why does glue crawler create these additional attributes and how can I prevent it from creating them? Or, how can I prevent kinesis from creating the above dir structure inside S3 and instead just dump the file with some timestamp?
To clarify, Kinesis Firehose is partitioning the data as it writes it to S3. The default behavior is to partition the data by
year,month,day, andhour.Glue Crawler creates partitions (or tables) based on the schema of the data being crawled. If schemas for files in the include path are similar, then the crawler will create a single table with partitions for each subfolder from the include path to the file.
Example: If the include path is
s3://<bucket>/prefix/andfile1.parquetandfile2.parquethave a similar schema, then the crawler will create 1 table with 4 partition columns (1 column for2022subfolder, 1 column for07subfolder, etc).You can't directly prevent the crawler from creating partitions. You can manipulate the include path to go deeper into the subfolder directory (e.g. set include path to
s3://<bucket>/prefix/2022/07/27/08), which will prevent partitions from being created depending on how deep the include path is. However, this is probably not what you want to do since it will result in multiple tables being created.Reference: How does a crawler determine when to create partitions? (AWS)
You may be able to achieve what you want with Dynamic Partitioning. Dynamic partitioning allows you to override the default
year/month/day/hourpartitioning. If your schema has some static value field, you could theoretically configure Firehose to partition the data based on that field and then configure the Glue Crawler include path to include that partition subfolder.Example: Firehose is configured to dynamically partition data based on the
static_fieldschema (static_fieldalways has the same value). If the Glue Crawler include path is set tos3://<bucket>/static_field=value/, then a single table will be created with only columns from the schema (no partitions).Reference: Dynamic Partitioning in Kinesis Data Firehose (AWS)
Suggestion: There are a few different ways to manipulate the data/partitioning. My suggestion is to not go against the default behavior for Firehose and Glue Crawler. Instead, consider how the partitioning implementation can be abstracted from the clients/consumers of this data. For example, create a materialized view that excludes partition columns.