I'm working through a Databricks example. The schema for the dataframe looks like:
|-- authors: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: struct (nullable = true)
| | | |-- key: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- type: string (nullable = true)
i try to make dataframe schema like below
|-- author_key: string (nullable = true)
|-- key: string (nullable = true)
|-- type: string (nullable = true)
I have no idea how to explode nested struct so I just want to take the key, type rows first by using explode, but I'm not sure this is the right way. Here is what I did:
- code
df
.select(explode($"authors"))
.select($"col.key", $"col.type")
.show()
- output
+---------------+----------------------+
| key | type |
+---------------+----------------------+
|/authors/<key1>| null |
|/authors/<key2>| null |
|/authors/<key3>| null |
|/authors/<key4>| null |
| null |{"key":"/type/auth..."|
|/authors/<key6>| null |
|/authors/<key7>| null |
+---------------+----------------------+
You could use
explodefunction to explode the array, then extract the needed data in separate columns, something like this: