I'm trying to read a big feather file and my process get's killed:
gerardo@hal9000:~/Projects/Morriello$ du -h EtOH.feather
5.3G EtOH.feather
I'm using pandas and pyarrow, here are the versions
gerardo@hal9000:~/Projects/Morriello$ pip freeze | grep "pandas\|pyarrow"
pandas==2.2.1
pyarrow==15.0.0
When I try to load the dataset into a dataframe I just get the process killed:
In [1]: import pandas as pd
In [2]: df = pd.read_feather("EtOH.feather", dtype_backend='pyarrow')
Killed
I'm on linux and I'm using Python 3.12, on a machine with 16Gb of RAM.
I saw the process get's killed due to an Out Of Memory error.
Out of memory: Killed process 918058 (ipython) total-vm:24890996kB, anon-rss:8455548kB, file-rss:640kB, shmem-rss:0kB, UID:1000 pgtables:17228kB oom_score_adj:100
I've also trying reading it in batches as suggested here and by @David but the process still gets killed:
In [4]: import pyarrow
In [5]: reader = pyarrow.ipc.open_file('./EtOH.feather')
In [6]: first_batch = reader.get_batch(0)
How do I read the file in this case? And if I manage to read it, would there been noticeable advantages in converting it to a parquet format?