I am using Spark SQL Java (Not scala or python). Here is a simplified example of my dataset (it has many more columns than this):
| ID | References | E_id | F_id | G_id |
|---|---|---|---|---|
| 0 | [E_id] | 0000 | ||
| 1 | [E_id, F_id, G_id] | 0010 | 1000 | 1111 |
There is a column called References which contains an array of column names that this row of data has. So the first row only has a Ex_1 column and would not have the columns Ex_2 or Ex_3 while the second row has all 3 columns. For each row, I need to select relevant columns (Like References and ID which all rows of data have) and then also select the respective columns that are included in References.
At the moment, this is my code:
final Dataset<Row> partitionedDf = df.select(col("id"), col("references));
I can't hardcode any of the Ex_1, Ex_2...etc columns into my select because the possible columns inside References are dynamic. I need to figure out a way to get References for each row and then select the columns inside References as well so it's all included in my Dataset.
One thing to note is that the columns inside reference do all follow a pattern (they all end with "_id") so I'm not sure if regex matching could be used here as well, rather than relying on what is inside the References array.