Dask: Would storage network speed cause a worker to die

250 Views Asked by schierkolk At 19 February 2021 at 12:31

I am running a process that writes large files across the storage network. I can run the process using a simple loop and I get no failures. I can run using distributed and jobqueue during off peak hours and no workers fail. However when I run the same command during peak hours, I get worker killing themselves.

I have ample memory for the task and plenty of workers, so I am not sitting in a queue.

The error logs usually has a bunch of over garbage collection limits followed by a Worker killed with Signal 9

Original Q&A

There are 1 best solutions below

mdurant On 19 February 2021 at 16:56

Signal 9 suggests that the process has violated some system limit, not that Dask has decided for the worker to die. Since this only happens on high disk IO at busy times, indeed I agree that the network storage is the likely culprit, e.g., a lot of writes have been buffered, but are not being cleared through the relatively low bandwidth.

Dask also uses local storage for temporary files, and "local" might be the network storage. If you have real local disks on the nodes, you should use that, or if not, maybe turn off disk-spilling altogether. https://docs.dask.org/en/latest/setup/hpc.html#local-storage

Dask: Would storage network speed cause a worker to die

There are 1 best solutions below

Related Questions in DASK

Related Questions in DASK-DISTRIBUTED

Related Questions in DASK-JOBQUEUE

Trending Questions

Popular # Hahtags

Popular Questions