PyAthena parses an ARRAY<VARCHAR> column correctly but the result is a string

212 Views Asked by At

Starting from a single column abc of type ARRAY<STRING> with one row:

SELECT ARRAY ['a', 'b', 'c'] AS abc

If we execute the query with pyathena using the ArrowCursor:

cursor = pyathena.connect(**AWS_PARAMETERS).cursor(ArrowCursor)
execution_result = cursor.execute(query)

Where the imports and parameters are:

import pyathena
from pyathena.arrow.cursor import ArrowCursor

AWS_PARAMETERS = {
    'aws_access_key_id': ***,
    'aws_secret_access_key': ***,
    'region_name': ***,
    's3_staging_dir': ***,
    'work_group': ***,
}

Inspecting the execution_results:

execution_result.description

It seems that pyathena got the column type right as array:

[('abc', 'array', None, None, 0, 0, 'UNKNOWN')]

However, when "materializing" the pyarrow table:

table = execution_result.as_arrow()

Unfortunately the column content becomes a STRING instead of an ARRAY<STRING>:

pyarrow.Table
abc: string
----
abc: [["[a, b, c]"]]

All I am trying to do here is to get that query result as a Polars (or Pandas) DataFrame with a column of type ARRAY<STRING>.

Is there a better and obvious way I am missing here to get the "type right"?

Should I really parse myself that string into a list of strings myself or even convert that string to JSON then to a list of string?

1

There are 1 best solutions below

0
Filippo Vitale On

8 months after asking this question the current version of pyathena-3.0.10 is correctly materializing the arrow table:

pyarrow.Table
abc: list<array_element: string>
  child 0, array_element: string
----
abc: [[["a","b","c"]]]