Most efficient way to save / load huge DataFrames?

717 Views Asked by At

I have a single pd.DataFrame with 4.3 million rows and 2 columns, like so:

Id     Features
27693  [1.6555750043281372e-09, -6.016701912292214e-2...]
27694  [-1.5324687581597672e-32, 1.0946759676292507e-4...]

Features is a column that stores 512-position numpy arrays. I need to have this structure physically stored in my device for loading on-demand, but I am not sure what is the best way to achieve feasible load times. Currently, my solution is to have the DataFrame split into 9 equally sized partitions (~500.000 rows) and saved to feather files.

Loading these 9 files consistently takes me around 21.609 seconds. Hypothetically, let's say this time-to-load needs to be as fast as it possibly can, while time-to-save isn't an issue.

Are there better formats or techniques to efficiently load large DataFrames with numpy rows into memory?

1

There are 1 best solutions below

3
Matan Bendak On

Depends on your data, if it's binary you could use byteArray.

If not- use numpy arrays and save using pickle. Save the index and column names separately.

Pandas DF is heavier