Pytorch Multi node training return TCPStore( RuntimeError: Address already in use

418 Views Asked by At

I am training a network on 2 machines each machine consists of two GPUS. I have checked the PORT Number to connect both machines to each other but everytime I got an error.

How to find the port number? sudo lsof -i :22 | grep LISTEN

sshd    2101    root    3u  IPv4  57356      0t0  TCP *:ssh (LISTEN)
sshd    2101    root    4u  IPv6  57358      0t0  TCP *:ssh (LISTEN)

Script

python imagenet_multi_node.py -a resnet50 --dist-url tcp://10.246.246.22:57356 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 -b 128 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/

Traceback:

Use GPU: 1 for training
Use GPU: 0 for training
Traceback (most recent call last):
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 511, in <module>
    main()
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 117, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 137, in main_worker
    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: Address already in use
0

There are 0 best solutions below