Is it better to store CUDA or CPU tensors that are loaded by torch DataLoader?

139 Views Asked by At

I am working on a project where I aim to train a PyTorch model on multiple GPUs.

My input data is stored in separate files for each training example, and during preprocessing, I save them using the torch.save method to .pt files. Later, I load these files using DataLoader, where I want to set num_workers > 0 to speed up the process. However, it seems that num_workers can only be set to >0 when the input data is on CPU.

My question is: Should I save CUDA tensors already and just use num_workers=0, or should I store CPU tensors, set num_workers > 0, and then move the batch as a whole to GPU?

I'm uncertain which approach would be more efficient for training speed (time) on multiple GPUs. Any insights or best practices on this matter would be greatly appreciated.

3

There are 3 best solutions below

0
Pranesh Rajeswaran On BEST ANSWER

In the scenario you described, it's generally recommended to store the tensors on CPU, set num_workers > 0, and then move the batches to GPU during training.

  1. Data Loading and Preprocessing:

    • When you set num_workers > 0 in the DataLoader, it creates multiple worker processes that load and preprocess the data in parallel.
    • These worker processes typically run on CPU and can efficiently load data from storage (e.g., disk or memory) and perform preprocessing operations.
    • By storing the tensors on CPU, you can leverage the multi-processing capabilities of the DataLoader to speed-up data loading and preprocessing.
  2. Memory Efficiency:

    • Storing all the tensors on GPU memory can quickly exhaust the available GPU memory, especially if you have a large dataset or limited GPU memory.
    • By keeping the tensors on CPU and only moving the batches to GPU during training, you can efficiently utilize GPU memory and avoid out-of-memory issues.
  3. Data Transfer Overhead:

    • Moving data from CPU to GPU does incur some overhead, but this overhead is typically small compared to the computation time on the GPU.
    • Modern GPUs have high-bandwidth memory interfaces (e.g., PCIe) that allow for fast data transfer between CPU and GPU.
    • The time spent transferring a batch of data from CPU to GPU is usually negligible compared to the time spent on forward and backward passes during training.
  4. Flexibility and Scalability:

    • By storing tensors on CPU, you have more flexibility in terms of data preprocessing and augmentation.
    • You can apply various transformations and data augmentation techniques on the CPU before moving the data to GPU for training.
    • This approach also allows you to scale your training to multiple GPUs more easily, as you can load and preprocess data on CPU and distribute the batches to different GPUs.

Therefore, the recommended approach is to store the tensors on CPU, set num_workers > 0 in the DataLoader, and then move the batches to GPU during training. This approach allows you to leverage parallel data loading and preprocessing on CPU while efficiently utilizing GPU memory for training.

However, we do must keep in mind that the optimal approach may vary depending on your specific use case, dataset size, and available hardware. If you have ample GPU memory and the data loading and preprocessing overhead is significant, you could consider storing the tensors on GPU and using num_workers=0. Experimentation based on our use case shall be needed.

0
dlPFC On

num_workers>0 allows parallelization on CPU, not on GPU. Generally, the best practice is to perform the data loading and preprocessing across multiple processes (using num_workers) on the CPU.

After the batch is collated by the DataLoader, transfer the entire batch to the GPU in your training loop.

For training on multiple GPUs you can use PyTorch's distributed data parallel (DDP) or data parallel (DP) wrappers to distribute the model and data across the GPUs.

0
Karl On

Generally speaking, you want to load and preprocess your data on the CPU and only move to the GPU right before passing to the model.

This is because GPU memory is limited. You don't want to use GPU memory storing data that isn't being used.

Maybe there are niche scenarios where you have preprocessed tensors saved and you can load directly to the GPU. However, this would require loading from disk to the GPU (to avoid having the full dataset sitting on GPU memory), which is rather slow. Doing your data processing on CPU allows you to load your dataset into memory and transfer data from CPU memory to GPU memory which is much faster.

For the case you're describing, it sounds like you're bottlenecked on i/o loading dataset files from disk. In this scenario, it would be best to have your dataset load/preprocess on the CPU, then move to GPU after batching in your dataloader.