I have some relatively large .mat files that I'm reading in to Python to eventually use in PyTorch. These files range in the number of rows (~55k to ~111k), but each has a little under 11k columns, with no header, and all the entries are floats. The data file sizes range from 5.8 GB to 11.8 GB. The .mat files came from a prior data processing step in Perl, so I'm not sure about the mat version; when I tried to load a file using scipy.io.loadmat, I received the following error: ValueError: Unknown mat file type, version 46, 56. I've tried pandas, dask, and astropy and been successful, but it takes between 4-6 minutes to load a single file. Here's the code for loading using each of the methods I've mentioned above, run as a timing experiment:
import pandas as pd
import dask.dataframe as dd
from astropy.io import ascii as aio
import numpy as np
import time
numberIterations = 6
daskTime = np.zeros((numberIterations,), dtype=float)
pandasTime = np.zeros((numberIterations,), dtype=float)
astropyTime = np.zeros(numberIterations,), dtype=float)
for ii in range(numberIterations):
t0 = time.time()
data = dd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
daskTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = pd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
pandasTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = aio.read(dataPath, format='fast_no_header', delimiter='\t', header_start=None, guess=False)
astropyTime[ii] = time.time() - t0
data = 0
del(data)
When I time these methods, dask is by far the slowest (by almost 3x), followed by pandas , and then astropy. For the largest file, the load time (in seconds) for 6 runs is:
dask: 1006.15 (avg), 1.14 (std)
pandas: 337.50 (avg), 5.84 (std)
astropy: 314.61 (avg), 2.02 (std)
I'm wondering if there is a faster way to load these files, since this is still quite long. Specifically, I'm wondering if there is perhaps a better library to use for the consistent loading of tabular float data and/or if there is a way to incorporate C/C++ or bash to read the files faster. I realize this question is a little open-ended; I'm hoping to get some ideas for how I can read these files in faster, so there is not a bunch of time wasted on just reading in the files.
Given these were generated in
perl, and given the code above works, these are tab-separated text files, not matlab files. Which would be appropriate forscipy.io.loadmat.Generally, reading in text is slow, and will depend heavily on compression and IO limitations.
FWIW pandas is already pretty well optimised under the hood, and I doubt you would get significant gains from using C directly.
If you plan to use these files frequently it might be worth using
zarrorhdf5to represent tabular float data. I'd lean towardszarrif you have some experience withdaskalready. They work nicely together.