I need to load a time-series dataset to train a network. The dataset was split into many chunks train_x_0.npy, train_x_1.npy, ..., train_x_40.npy (41 chunks) because of memory issue when I extract these .npy files from the raw data. However, their sizes are too large (around 1000 GB) that I couldn't load everything into the RAM. I have been considering two ways to solve this problem.
- Loading the data chunks using
np.load()with argumentmmap_mode='r+'. The memory-mapped chunks are stored in a Python listself.data. In the__getitem__(self, idx)method of PytorchDatasetclass, I convertidxtochunk_idxandsample_idx, then get the sample byself.data[chunk_idx][sample_idx]. - Extract
.npyfiles again from raw data, and save the data sample-by-sample, i.e. one.npyfile is now one sample, not a data chunk. In the__getitem__(self, idx)method, I will get one sample by loading it usingnp.load(sample_path).
Assuming the Pytorch DataLoader will be used to iterate through all samples, then which method will be faster?
If you have another suggestion to extract the raw data or to load the .npy files, please share your opinion.
Both suggested approaches will be limited by your filesystem's IO, since each sample will be loaded from disk on-demand (memory mapping does not speed up the actual loading, once a given patch is requested).
Especially when you are planning to train for many epochs, you can achieve a strong speedup by loading your original chunks
train_x_0.npy,train_x_1.npy, etc. one (or as many as you can hold in RAM) at a time and training multiple epochs on this chunk before switching to the next.For this, you would need control over the sample indices requested by the
dataloader. For that you could define a sampler which is passed the sample indices available in the respective cached data chunk. In pseudocode, your training loop could look something like this when caching one chunk at a time:Hereby, your
Datasetclass needs to take care ofcache_chunkmethod)get_chunk_sample_indsmethod)If you use a fast GPU (which is often limited by shuffling data back and forth between RAM and VRAM, even for RAM-cached data), you can expect several orders of magnitude speed up using this approach (as opposed to attempting to fill the VRAM from HDD for each sample).