I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.
# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}
# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])
for file_loc in file_list.Loc:
tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")
arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)
save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
arrays.to_parquet(path=save_loc)
Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.
Of note, the most simplest dataframe I've had this error with has 2 columns with different length awkward arrays (awkward.highlevel.Array) in each row but the same for each column. An example is given below.
A B
0 [31, 26, 17, 23] [-2.1, 1.3, 0.5, -0.4]
1 [75, 15, 49] [2.4, -1.8, 0.8]
2 [58, 45, 64, 47] [-1.9, -0.4, -2.5, 1.3]
3 [26] [-1.1]
I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.
Solution: Upgrade your
awkward-pandaspackage. When I first tried to reproduce your problem withawkward-pandasversion 2022.12a1, I saw the same error, then I upgraded to 2023.8.0 and it's gone.Detective work: I'm writing all of this down because I'm so proud of myself.
:)I'm guessing that the data in
f"{file_path}/{file_loc}:CollectionTree"is ragged. There's no indication of this in your example, but if it were purely numerical data types (no variable-length lists or nested data structures), then thearrayswould be a normal Pandas DataFrame. If, in that case, you got an error, it would be a Pandas error—possible, but less likely because someone else would have noticed it first.So assuming that
arraysis a DataFrame of ragged data (and this is Uproot >= 5.0), the data types in each column are managed with awkward-pandas. If so, I should be able to reproduce the error like this:and I do (with
awkward-pandasversion 2022.12a1):(For the future: including a whole stack trace would remove a lot of guesswork.)
I first thought, "Maybe
awkward-pandashasn't implemented the__arrow_array__protocol." But no, theAwkwardExtensionArrayhas an__arrow_array__method:Then, "Maybe it has an
__arrow_array__method, but that method doesn't take atypeargument," which is what the error message is saying.Aha! That's it! So I was about to write an issue on
awkward-pandas, and in so doing, point out the function definition that's missing atypeargument. But the function definition isn't missing atypeargument.https://github.com/intake/awkward-pandas/blob/1f8cf19fdc9cb0786642f39cfaf7c084c3c5c9bc/src/awkward_pandas/array.py#L148-L151
It's just that my copy of the package was old. This is an old bug that has since been fixed.
I upgraded my
awkward-pandasand it all works now:(no errors)
(reads back appropriately)