torch.distributed.barrier() added on all processes not working

1.3k Views Asked by Priya Gupta At 13 November 2022 at 02:29

import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])

if local_rank >0:
    torch.distributed.barrier()

print(f"Entered process {local_rank}")

if local_rank ==0:
    torch.distributed.barrier()

The above code gets hanged forever but if I remove both torch.distributed.barrier() then both print statements get executed. Am I missing something here?

On the command line I execute the process using torchrun --nnodes=1 --nproc_per_node 2 test.py where test.py is the name of the script

tried the above code with and without the torch.distributed.barrier() With the barrier() statements expecting the statement to print for one gpu and exit -- not as expected Without the barrier() statements expecting both to print -- as expected

Am I missing something here?

Original Q&A

There are 1 best solutions below

TQCH On 13 November 2022 at 06:55

It is better to put your multiprocessing initialization code inside the if __name__ == "__main__": to avoid endless process generation and re-design the control flow to fit your purpose:

if __name__ == "__main__":
    import torch
    import os
    torch.distributed.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])

    if local_rank > 0:
        torch.distributed.barrier()
    else:
        print(f"Entered process {local_rank}")
        torch.distributed.barrier()

torch.distributed.barrier() added on all processes not working

There are 1 best solutions below

Related Questions in PYTORCH

Related Questions in DISTRIBUTED

Related Questions in TORCH

Related Questions in BARRIER

Trending Questions

Popular # Hahtags

Popular Questions