I have a large data file (N,4) which I am mapping line-by-line. My files are 10 GBs, a simplistic implementation is given below. Though the following works, it takes huge amount of time.
I would like to implement this logic such that the text file is read directly and I can access the elements. Thereafter, I need to sort the whole (mapped) file based on column-2 elements.
The examples I see online assumes smaller piece of data (d) and using f[:] = d[:]but I can't do that since d is huge in my case and eats my RAM.
PS: I know how to load the file using np.loadtxt and sort them using argsort, but that logic fails (memory error) for GB file size. Would appreciate any direction.
nrows, ncols = 20000000, 4 # nrows is really larger than this no. this is just for illustration
f = np.memmap('memmapped.dat', dtype=np.float32,
mode='w+', shape=(nrows, ncols))
filename = "my_file.txt"
with open(filename) as file:
for i, line in enumerate(file):
floats = [float(x) for x in line.split(',')]
f[i, :] = floats
del f
EDIT: Instead of do-it-yourself chunking, it's better to use the chunking feature of pandas, which is much, much faster than numpy's
load_txt.The
pd.read_csvfunction in chunked mode returns a special object that can be used in a loop such asfor chunk in chunks:; at every iteration, it will read a chunk of the file and return its contents as a pandasDataFrame, which can be treated as a numpy array in this case. The parameternamesis needed to prevent it from treating the first line of the csv file as column names.Old answer below
The
numpy.loadtxtfunction works with a filename or something that will return lines in a loop in a construct such as:It doesn't even need to pretend to be a file; a list of strings will do!
We can read chunks of the file that are small enough to fit in memory and provide batches of lines to
np.loadtxt.Disclaimer: I tested this in Linux. I expect this to work in Windows, but it could be that the handling of '\r' characters causes problems.