I have a huge multiindex dataframe in long format. There's only one "value" column. Some entries in "value" are of type pd.Timedelta.
I got an error when trying to save that dataframe as parquet file using pd.to_parquet:
pyarrow.lib.ArrowInvalid: ("Could not convert Timedelta('2903 days 00:00:00') with type Timedelta: tried to convert to double", 'Conversion failed for column value with type object')
If I convert the particular value from the error message to numpy, I get the following:
array([Timedelta('2903 days 00:00:00')], dtype=object)
I set up a toy example to check if it is possible at all to convert pd.Timedelta to parquet. The following code works just fine:
import pandas as pd
idx = pd.MultiIndex.from_tuples(
[("A", "x"), ("A", "y"), ("B", "x"), ("B", "y")],
names=["idx1", "idx2"])
data = {"value": [
pd.Timedelta(days=10),
2.5,
pd.Timedelta(days=20),
5
]}
df = pd.DataFrame(data=data, index=idx)
df.to_parquet("test.parquet")
x = pd.read_parquet("test.parquet")
A simple df.iloc[0, :].to_numpy() delivers the exact same type as in my real dataframe: array([Timedelta('10 days 00:00:00')], dtype=object).
I am wondering what might be the difference of my original dataframe compared to my toy example?
This part of your error
'Conversion failed for column value with type object', indicates that your column is a mixture between floats and time delta, as you expected.The problem comes from the fact that pyarrow when you have a big dataframe infers, the types from the first rows of the pandas dataframe. Most probably in your case they are all floats. This code (the same example you have with a small modification) will reproduce your error:
The problem here is that a column in parquet cannot have multiple types. In your example if you load the saved parquet you will see that everything has been converted to timedelta. For example
x = pd.read_parquet("test.parquet")follow byx.iloc[1, :].to_numpy()delivers thisarray([2], dtype='timedelta64[us]')If the float values do not represent a delta time you should do as the comment suggest and unpivot the data. If the float values represent a deltatime, you can transform them before saving, or pass a schema when saving.
or