AWS Glue grokLog pattern for variable Column layouts/schemas

127 Views Asked by At

I am reading a fixed width file from S3 in Glue using create_dynamic_frame.from_options (Not using Catalog)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [s3_path]},
    format="GrokLog",
    format_options={
        "logFormat": "%{GETEMPID:EMPID}%{GETNAME:Name}%{GETDOB:DOB}%{GETCOMPANY:Company}%{GREEDYDATA:extras}",
        "customPatterns": "GETEMPID ([^*]{5})\nGETNAME ([^*]{8})\nGETDOB ([^*]{5})\nGETCOMPANY ([^*]{7})"      
)

I am using this just to parse my fixed width file and split it into different columns. Any extras are captured in %{GREEDYDATA:extras}

The problem is that I can get the same fixed width file an extra column or more importantly, 2 less columns. When the column layout itself is variable, can I write the logFormat pattern in such a way that it can recognize this and parse accordingly?

I tried reading a file with some records which have lesser number of records. In that case, the record gets skipped(I need this because I need to generate an error DF at the end)

0

There are 0 best solutions below