import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
if local_rank >0:
torch.distributed.barrier()
print(f"Entered process {local_rank}")
if local_rank ==0:
torch.distributed.barrier()
The above code gets hanged forever but if I remove both torch.distributed.barrier() then both print statements get executed. Am I missing something here?
On the command line I execute the process using torchrun --nnodes=1 --nproc_per_node 2 test.py where test.py is the name of the script
tried the above code with and without the torch.distributed.barrier() With the barrier() statements expecting the statement to print for one gpu and exit -- not as expected Without the barrier() statements expecting both to print -- as expected
Am I missing something here?
It is better to put your multiprocessing initialization code inside the
if __name__ == "__main__":to avoid endless process generation and re-design the control flow to fit your purpose: