How can I move tensors from one gpu to another in training_step of pl.LightningModule?
In torch-based pipeline I use the following function to move tensors during multi-gpu fitting procedure:
def neighbour_exchange_bidir(left_rank, right_rank, tensor_to_left, tensor_to_right, group=None):
tensor_from_left = torch.zeros_like(tensor_to_right)
tensor_from_right = torch.zeros_like(tensor_to_left)
send_op_left = torch.distributed.P2POp(
torch.distributed.isend,
tensor_to_left,
left_rank,
group=group,
)
send_op_right = torch.distributed.P2POp(
torch.distributed.isend,
tensor_to_right,
right_rank,
group=group,
)
recv_op_left = torch.distributed.P2POp(
torch.distributed.irecv,
tensor_from_left,
left_rank,
group=group,
)
recv_op_right = torch.distributed.P2POp(
torch.distributed.irecv,
tensor_from_right,
right_rank,
group=group,
)
reqs = torch.distributed.batch_isend_irecv([send_op_right, send_op_left, recv_op_right, recv_op_left])
for req in reqs:
req.wait()
return tensor_from_right, tensor_from_left
However, its not clear how to use it in Lightning as I need to get devices' ids to move tensors and I found no examples of using torch.distributed.P2POp with Lightning.
Looks like
self.trainer.local_rank, self.trainer.global_rank, self.trainer.world_sizemay help, going to reply myself as soon as I try it