Erro of fatal temporary exhaustion of send tid dma descriptors for MPI run in Cluster

26 Views Asked by At

I have set up a job for an MPI app on a cluster that's working fine up to 240 processors but for 480 processors and higher, I'm getting a weird error that I don't know the root of the issue. Just when the first communication starts, all processors will throw the following errors: Non-fatal temporary exhaustion of send tid dma descriptors (elapsed=1023.354s, source LID=0x5f1/context=5, count=880767386) (err=0)

The job is set to be on 10 nodes, each having 48 cores.

0

There are 0 best solutions below