I am working on a project where I aim to train a PyTorch model on multiple GPUs.
My input data is stored in separate files for each training example, and during preprocessing, I save them using the torch.save method to .pt files. Later, I load these files using DataLoader, where I want to set num_workers > 0 to speed up the process. However, it seems that num_workers can only be set to >0 when the input data is on CPU.
My question is: Should I save CUDA tensors already and just use num_workers=0, or should I store CPU tensors, set num_workers > 0, and then move the batch as a whole to GPU?
I'm uncertain which approach would be more efficient for training speed (time) on multiple GPUs. Any insights or best practices on this matter would be greatly appreciated.
In the scenario you described, it's generally recommended to store the tensors on CPU, set
num_workers > 0, and then move the batches to GPU during training.Data Loading and Preprocessing:
num_workers > 0in theDataLoader, it creates multiple worker processes that load and preprocess the data in parallel.DataLoaderto speed-up data loading and preprocessing.Memory Efficiency:
Data Transfer Overhead:
Flexibility and Scalability:
Therefore, the recommended approach is to store the tensors on CPU, set
num_workers > 0in theDataLoader, and then move the batches to GPU during training. This approach allows you to leverage parallel data loading and preprocessing on CPU while efficiently utilizing GPU memory for training.However, we do must keep in mind that the optimal approach may vary depending on your specific use case, dataset size, and available hardware. If you have ample GPU memory and the data loading and preprocessing overhead is significant, you could consider storing the tensors on GPU and using
num_workers=0. Experimentation based on our use case shall be needed.