I have to validate Fixed Width files that I am reading from S3 to Glue. I have the lengths of each column and I have to write a glue job to validate these files.
How do I efficiently check the lengths of every row to filter out the records which don't have the correct total_length?
What is the best way to read such files?
I tried reading it as CSV into one col0 in the Dynamic Frame and tried to filter out length using FILTER but this gives me a dictionary
bad_length_DF = dynamicFramerawtxt.filter(lambda x: len(x['col0']) != total_row_len)
How do I remove the records from my Dynamic Frame that have wrong lengths and create a an ERROR_Dynamic frame?
My general recommendation is to use Spark dataframe instead of Glue dynamicframe unless you need to use the built-in transformations (doc) or Glue job bookmark (doc).
Below is a complete PySpark script for your scenario.