MPI spawning in loop leads to shared memory error

89 Views Asked by At

I have two python scripts - parent.py and child.py. The parent.py file spawns child.py file with desired number of processors in a loop. Minimal working example for both the files is shown below:

Parent.py

import sys, psutil
from mpi4py import MPI

nprocs = 16

for i in range(200):

    print("Spawning {}".format(i+1), flush=True)

    child_comm = MPI.COMM_WORLD.Spawn(sys.executable, "child.py", maxprocs=nprocs)

    pid_list = child_comm.gather(None, root=MPI.ROOT)

    print("Data: {}".format(pid_list))

    for index, pid in enumerate(pid_list):
        pid_list[index] = (psutil.Process(pid))
    
    child_comm.Disconnect()

    # Waits for all the spawned process to terminate
    while len(pid_list) != 0:
        for pid in pid_list:
            if not pid.is_running():
                pid_list.remove(pid)

    print("End {}".format(i+1))

Child.py

import os
from mpi4py import MPI

comm = MPI.COMM_WORLD

parent_comm = MPI.Comm.Get_parent()

print("Sending the PID, rank: {}".format(comm.rank), flush=True)

parent_comm.gather(os.getpid(), root=0)

parent_comm.Disconnect()

When the number of iterations is around 100-150, it works fine. But, when I run it for large number of iterations (> 150), I get following error:

[1697381880.404508] [a073:2653601:0]             sys.c:915  UCX  ERROR   shmget(size=12288 flags=0x7b0) for mm_recv_fifo failed: No space left on device, please check shared memory limits by 'ipcs -l'
[1697381880.404522] [a073:2653601:0]         mm_sysv.c:114  UCX  ERROR   failed to allocate 8447 bytes with mm for mm_recv_fifo
[1697381880.404527] [a073:2653601:0]         uct_mem.c:157  UCX  ERROR   failed to allocate 8447 bytes using md sysv for mm_recv_fifo: Out of memory
[1697381880.404531] [a073:2653601:0]        mm_iface.c:781  UCX  ERROR mm_iface failed to allocate receive FIFO
pml_ucx.c:309  Error: Failed to create UCP worker
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      a073
  Framework: pml
--------------------------------------------------------------------------
PML ucx cannot be selected

I am running this on a HPC and all the processes are spawned on the same node, I do not get this error when I run it on my local system. Also, the number of iterations after which the error occurs is different everytime. One of the solutions is to increase the number of shared memory segments but I cannot do that since I don't have administrator privileges. I contacted HPC folks and they said that the problem is in the way I am using OpenMPI.

I think the problem is spawning of processes in a loop. Whenever spawning occurs, I think it reserves a set of memory (I am new to openmpi so I am not very sure, please correct me) and when I do this for large number of times, that memory eventually runs out. I cannot use os.system("mpirun -n 16 python child.py") in parent.py file since parent.py initializes the MPI and becomes MPI singleton.

How can I resolve this error? Is there an alternative way instead of repeated spawning? Or are there any env variables which needs to be set? Any help will be much appreciated!

Additional Info:
OpenMPI: 4.1.4
mpi4py: 3.1.4
Python: 3.9

0

There are 0 best solutions below