Amazon Glue - load to Redshift failures with decimal fields

64 Views Asked by At

I have a very simple Glue job loading data from S3 to Redshift, with a Transform in between to rename fields and change their types:

enter image description here

First execution runs (almost) without issues - data gets loaded into Redshift. All following executions fail. The reason is, Glue properly creates the Redshift table (first load) but incorrectly handles it when it already exists.

This happens for all fields converted to decimal (did not test all other types though).

CSV file:

Text value,Average whatever,Another string,Just a number
A1,2.2,test,5
A2,5,test2,7

Transform (change schema):

enter image description here

Generated code (I did not edit the code, it's still a "Visual" job):

...
# Script generated for node Amazon S3
AmazonS3_node1710618800725 = glueContext.create_dynamic_frame.from_options(format_options={"quoteChar": "\"", "withHeader": True, "separator": ","}, connection_type="s3", format="csv", connection_options={"paths": ["s3://<source-s3-bucket>/test/gonna_fail/data.csv"]}, transformation_ctx="AmazonS3_node1710618800725")

# Script generated for node Change Schema
ChangeSchema_node1710691042153 = ApplyMapping.apply(frame=AmazonS3_node1710618800725, mappings=[("Text value", "string", "text_value", "string"), ("Average whatever", "string", "average_whatever", "decimal"), ("Another string", "string", "another_string", "string"), ("Just a number", "string", "just_a_number", "decimal")], transformation_ctx="ChangeSchema_node1710691042153")

# Script generated for node Amazon Redshift
AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, connection_type="redshift", connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", "useConnectionProperties": "true", "dbtable": "raw_data.gonna_fail", "connectionName": "serverless-redshift", "preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);"}, transformation_ctx="AmazonRedshift_node1710618808047")
  1. First execution (all queries come from Redshift's log):
7:26:26 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Table gets properly created. After that there is an error - "Expected command status BEGIN, got CREATE TABLE" - (how to avoid it?) but the job retries successfully after 30 seconds:

7:26:56 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN 
7:26:56 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:26:56 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL) 
7:26:56 PM  COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/63e30430-67f0-4ab2-b539-22180ae2920b/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS '' 
  1. Second execution:

For each decimal field a new one is created, with a name being concatenated name and data type:

7:29:19 PM  ALTER TABLE raw_data.gonna_fail add "average_whatever_decimal(10,2)" DECIMAL(10,2) default NULL; 
7:29:19 PM   ALTER TABLE raw_data.gonna_fail add "just_a_number_decimal(10,2)" DECIMAL(10,2) default NULL; 

This load also fails (did not check why) and retries after 30 seconds:

Create table gets executed (not sure why create statement is executed two times, "automatically", and using preactions):

7:29:54 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2), "just_a_number_decimal(10,2)" DECIMAL(10,2), "average_whatever_decimal(10,2)" DECIMAL(10,2)) DISTSTYLE EVEN 

Preactions:

7:29:54 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:29:54 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL) 

Incorrect copy statement:

7:29:54 PM COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number","just_a_number_decimal(10,2)","average_whatever_decimal(10,2)") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/d46ca4ae-86cc-4444-addd-6c54c376a2a1/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS ''

This operations fails, Spark retries 3 times and fails the load. Error visible in Glue:

Caused by: com.amazon.redshift.util.RedshiftException: ERROR: column "just_a_number_decimal(10,2)" of relation "gonna_fail" does not exist

I did not find those additional/incorrect fields in the frame's .schema().fields.

1

There are 1 best solutions below

0
Bogdan On

I dont think there is much you can do from the visual tool. Clone the job and update the script as following :

AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, 
connection_type="redshift", 
connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", 
"dbtable": "raw_data.temp_gonna_fail", 
"connectionName": "serverless-redshift", 
"preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);",
"postactions": "BEGIN; INSERT INTO raw_data.gonna_fail SELECT  * from raw_data.temp_gonna_fail; drop table if exists raw_data.temp_gonna_fail; END;"
}, 
transformation_ctx="AmazonRedshift_node1710618808047"), 

Added "postactions" to the connection_options and removed "useConnectionProperties": "true" (if the script fails because of this, add it again)