Questions about batchsize and learning rate settings for DDP and single-card training
Single-card network training, batchsize = 8, learning rate = 10e-4
Now it is changed to DDP single machine multi-card (one node 4 GPUs) training Like the following solution:
import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method=init_method, world_size=args.nprocs, rank=local_rank)
At this time, if the batch size is set to 8 and the learning rate is set to 10e-4, the training time of DPP is less than single card. However, looking at the loss curve, the number of epochs required for convergence is obviously much greater than that of a single card.
I would like to ask, if the batch size remains unchanged in DDP, does the learning rate need to be increased? how to increased?
If I want to keep the same training effect as a single card, whether to set the batchsize to (8 / gpu numbers)?
I have tried setting batchsize on DDP to 2, which is the batchsize/gpu number of a single card, but there is no conclusion yet.
The batch size in DDP is batch size per GPU. With four GPUs, your true batch size is
8*4=32. When your batch size is larger, you have fewer batches per epoch, and therefore fewer gradient updates. For your case, you would need 4 epochs of DDP training to have the same number of parameter update steps as one epoch of single GPU training.DDP averages gradients between nodes, so you don't need to worry about changing the learning rate (this would not be the case if gradients were summed).