I'm trying to query a dynamodb export using AWS Glue and Athena. I set up a glue crawler to create tables from the exported file, but the output table of interest "data" has only one column "item". Item is a struct which has an assortment of nested files such that the table definition looks like this:
root
|-- Item: struct
| |-- pk: struct
| | |-- S: string
| |-- sk: struct
| | |-- S: string
| |-- createdAt: struct
| | |-- N: string
| |-- event: struct
| | |-- S: string
| |-- data: struct
| | |-- S: string
| |-- resultId: struct
| | |-- NULL: boolean
| | |-- S: string
| |-- updatedAt: struct
| | |-- N: string
| |-- answers: struct
| | |-- S: string
| |-- result: struct
| | |-- S: string
| |-- interactions: struct
| | |-- S: string
What do I need to do to be able to use Athena to query the table so that I get back a response with columns: pk, sk, createdAt etc. rather than just a row with everything packed into one struct?
I've tried applying a transform in a Glue ETL script,using function that looks like:
def transform(rec):
rec["pk"] = rec["Item"]["pk"]["S"]
del rec["Item"]["pk"]
...
This appears to work for the first few keys (which I think are in every row, but when I hit columns that are not in every row, it begins to fail...
For that you need to use a JSON classifier.
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
The classifier should look something similar to this:
$[*]