Why is AWS Glue job creating parition file names with 'unnamed' included in the file's name?

456 Views Asked by At

We are using an AWS Glue job to load and de-dupe data and we are making a change to no longer use the crawler to determine schema meta data - we are now explicitly defining it.

As a result, we are using AWS's recommended method 2 (see below)

https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html

sink = glueContext.getSink(connection_type="s3",
                           path=tgt_path,
                           enableUpdateCatalog=True,
                           partitionKeys=partition_key)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=tgt_db, catalogTableName=tgt_table)
sink.writeFrame(last_transform)

We use this code in two separate jobs. The first job writes partition files with the following naming convention:

  • run-timestamp-part-block-0-0-r-someNumber-snappy.parquet

Example: run-1659269628417-part-block-0-0-r-00001-snappy.parquet

However, the second job is writing the files with the following naming convention:

  • run-unnamed-36-part-block-0-0-r-someNumber-snappy.parquet

Example: run-unnamed-36-part-block-0-0-r-00001.snappy.parquet

Does anyone know why unnamed is being applied to the file name as opposed to a timestamp? I have searched AWS's documentation, but have not had much success in getting an answer. The below link indicates that it is not possible to specify the target name on-the-fly - the file name can only be changed afterwards.

Note: the data in the unnamed file appears to be accurate.

AWS Glue Job Output File Name

Any help is appreciated.

0

There are 0 best solutions below