I'm trying to convert a lot of avro files to json. No need for any compression or repartition, just 1-1 conversion would work. I did this before on one batch of files which worked fine but on a different batch I'm getting "An error occurred while calling o100.pyWriteDynamicFrame. Invalid sync!". I'm using standard code from AWS docs, not sure what's causing this, I'm thinking it has to do with reading or writing avro to json? Any help appreciated.
data_source_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
"paths": [S3_inputpath]
},
format="avro",
)
data_destination_frame = glueContext.write_dynamic_frame.from_options(
frame=data_source_frame,
connection_type="s3",
connection_options={"path": S3_outputpath},
format="json",
)
Here's the error I'm getting:
py4j.protocol.Py4JJavaError: An error occurred while calling o100.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 52 in stage 0.0 failed 4 times, most recent failure: Lost task 52.3 in stage 0.0 (TID 73) (172.35.94.104 executor 5): org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
I read in aws docs that partitioning and grouping are not working with avro format in Glue so I'm no using those, tried changing the format to parquet and csv, checked the directory for null and empty files and removed them, but still getting the same error.