I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however the dist.init_process_group waits and times out
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00)
What could go wrong? machines not networked together?
This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and see if you can figure out the difference.
https://github.com/aruncs2005/pytorch-ddp-sagemaker-example