Query an Array column with an indices column

43 Views Asked by At

So I have two columns

| col_arr | col_ind | 
|[1, 2, 3]| [0, 2]  |
|[5, 1]   | [1]     |

and I'd like my result to be an extraction of the values in col_arr by col_ind resulting in col_val below:

| col_arr | col_ind | col_val |
|[1, 2, 3]| [0, 2]  | [1, 3]  |
|[5, 1]   | [1]     | [1]     |

What would be the most elegant and efficient way to do this?


My first idea is to use a UDF, but it feels like an overkill

@udf
def sub_select(arr, inds):
    if (arr is not None) and (inds is not None):
        return [arr[ind] for ind in inds]

I was also thinking about the array_position function dynamically with expr. It is not clear to me how to do so in a way that is flexible to the length of col_ind.

F.expr("array_position(col_arr, array_position(col_ind, 0))")

Simplifications and complications:

  1. Assume the length of indices to query is bounded by a small number (ex. the max len of a col_ind list is 5).
  2. What if I have several col_arr columns (col_arr1, col_arr2, col_arr3), and only one col_ind?
1

There are 1 best solutions below

0
Shubham Sharma On

Apply a transformation function on col_ind and for each index return the corresponding value from col_arr

df = df.withColumn('col_val', F.expr("transform(col_ind, i -> col_arr[i])"))

# df.show()
# +---------+-------+-------+
# |  col_arr|col_ind|col_val|
# +---------+-------+-------+
# |[1, 2, 3]| [0, 2]| [1, 3]|
# |   [5, 1]|    [1]|    [1]|
# +---------+-------+-------+

Bonus: What if I have several col_arr columns (col_arr1, col_arr2, col_arr3), and only one col_ind? Same approach can be used for each column seperately