dask sshcluster results in: RuntimeError: Cluster failed to start: Worker failed to start

70 Views Asked by At

I try to get dask sshcluster (https://docs.dask.org/en/latest/deploying-ssh.html) up and running results in:

RuntimeError: Cluster failed to start: Worker failed to start

Python = = 3.11.6

Dask == 2023.11.0

Distributed == 2023.11.0

Windows 11 Pro, Version 10.0.22621 Build 22621

PyCharm 2023.2.5 (Community Edition), Build #PC-232.10227.11, built on November 14, 2023

If i run the following little programm:

from dask.distributed import SSHCluster,Client,

cluster=SSHCluster(["localhost"])
client=Client(cluster)

I get the following error message during executing in PyCharm:

cluster=SSHCluster(["localhost"])

DEBUG:distributed.deploy.ssh:Created Scheduler Connection
WARNING:distributed.deploy.spec:Cluster closed without starting up
Traceback (most recent call last):
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\spec.py", line 325, in _start
self.scheduler = await self.scheduler
                     ^^^^^^^^^^^^^^^^^^^^
    
File "D:\\Users\\fourb\\anaconda3\\envs\\Py11\\Lib\\site-packages\\distributed\\deploy\\spec.py", line 74, in await self.start()
File "D:\\Users\\fourb\\anaconda3\\envs\\Py11\\Lib\\site-packages\\distributed\\deploy\\ssh.py", line 250, in start
raise Exception(
Exception: Scheduler failed to set DASK_INTERNAL_INHERIT_CONFIG variable
    
The above exception was the direct cause of the following exception:
    
Traceback (most recent call last):
    
File "D:\\Users\\fourb\\anaconda3\\envs\\Py11\\Lib\\site-packages\\distributed\\utils.py", line 408, in f
result = yield future
^^^^^^^^^^^^
File "D:\\Users\\fourb\\anaconda3\\envs\\Py11\\Lib\\site-packages\\tornado\\gen.py", line 767, in run
value = future.result()
^^^^^^^^^^^^^^^
File "D:\\Users\\fourb\\anaconda3\\envs\\Py11\\Lib\\site-packages\\distributed\\deploy\\spec.py", line 335, in \_start
raise RuntimeError(f"Cluster failed to start: {e}") from e
RuntimeError: Cluster failed to start: Scheduler failed to set DASK_INTERNAL_INHERIT_CONFIG variable

Here is a hint how to solve this issue: changing in ssh.py row 244 cmd /c ver to cmd.exe /c ver form https://github.com/dask/distributed/issues/5411

I changed that in ssh.py. After that I run in the following error:

DEBUG:distributed.deploy.ssh:Created Scheduler Connection
INFO:distributed.deploy.ssh:Traceback (most recent call last):
INFO:distributed.deploy.ssh:File "<frozen runpy>", line 198, in _run_module_as_main
INFO:distributed.deploy.ssh:File "<frozen runpy>", line 88, in _run_code
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\cli\dask_spec.py", line 67, in <module>
INFO:distributed.deploy.ssh:main()
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\click\core.py", line 1157, in call
INFO:distributed.deploy.ssh:return self.main(*args, **kwargs)
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\click\core.py", line 1078, in main
INFO:distributed.deploy.ssh:rv = self.invoke(ctx)
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\click\core.py", line 1434, in invoke
INFO:distributed.deploy.ssh:return ctx.invoke(self.callback, **ctx.params)
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\click\core.py", line 783, in invoke
INFO:distributed.deploy.ssh:return __callback(*args, **kwargs)
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\cli\dask_spec.py", line 33, in main
INFO:distributed.deploy.ssh:spec.update(json.loads(spec))
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\json_init.py", line 346, in loads
INFO:distributed.deploy.ssh:return _default_decoder.decode(s)
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\json\decoder.py", line 337, in decode
INFO:distributed.deploy.ssh:obj, end = self.raw_decode(s, idx=_w(s, 0).end())
INFO:distributed.deploy.ssh:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:distributed.deploy.ssh:File "D:\Users\fourb\anaconda3\envs\Py11\Lib\json\decoder.py", line 355, in raw_decode
INFO:distributed.deploy.ssh:raise JSONDecodeError("Expecting value", s, err.value) from None
INFO:distributed.deploy.ssh:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
WARNING:distributed.deploy.spec:Cluster closed without starting up
Traceback (most recent call last):
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\spec.py", line 325, in _start
self.scheduler = await self.scheduler
^^^^^^^^^^^^^^^^^^^^
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\spec.py", line 74, in _await self.start()
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\ssh.py", line 270, in startraise 
Exception("Worker failed to start")
Exception: Worker failed to start

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\utils.py", line 408, in 
fresult = yield future
^^^^^^^^^^^^
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\tornado\gen.py", line 767, in 
runvalue = future.result()
^^^^^^^^^^^^^^^
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\spec.py", line 335, in _startraise 
RuntimeError(f"Cluster failed to start: {e}") from e
RuntimeError: Cluster failed to start: Worker failed to start

Do you habe any hints how to solve this issue? The SSH is configured to work keylessly.

If I start the cluster from comand line: https://docs.dask.org/en/latest/deploying-cli.html it work:

(Py11) C:\Users\fourb>dask scheduler

----------------------------------------------- 
WARNING:bokeh.server.util:Host wildcard '*' will allow connections originating from multiple (or possibly all) hostnames or IPs. Use non-wildcard values to restrict access explicitly`
State start

Clear task state
Scheduler at: tcp://192.168.178.69:8786`
dashboard at:  http://192.168.178.69:8787/status
Registering Worker plugin shuffle`
Register worker <WorkerState 'tcp://192.168.178.69:6057', status: init, memory: 0, processing: 0>`
Starting worker compute stream, tcp://192.168.178.69:60357`
Worker status init -> running - <WorkerState 'tcp://192.168.178.69:60357', status: running, memory: 0, processing: 0>` ```

(Py11) C:\Users\fourb>dask worker tcp://192.168.178.69:8786`


WARNING:bokeh.server.util:Host wildcard '*' will allow connections originating from multiple (or possibly all) hostnames or IPs. Use non-wildcard values to restrict access explicitlyStart worker at: tcp://192.168.178.69:60357Listening to: tcp://192.168.178.69:60357dashboard at:       192.168.178.69:60358Waiting to connect to:  tcp://192.168.178.69:8786


              Threads:                         24
               Memory:                  31.92 GiB
      Local Directory: C:\Users\fourb\AppData\Local\Temp\dask-scratch-space\worker-z6d6ss4p

-------------------------------------------------


Starting Worker plugin shuffle 
Registered to:  tcp://192.168.178.69:8786
Heartbeat: tcp://192.168.178.69:60357
Heartbeat: tcp://192.168.178.69:60357
Heartbeat: tcp://192.168.178.69:60357
Heartbeat: tcp://192.168.178.69:60357
Heartbeat: tcp://192.168.178.69:60357
Heartbeat: tcp://192.168.178.69:60357

If I try the following:

(Py11) C:\Users\fourb>dask-ssh localhost

D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\cli\dask_ssh.py:150: FutureWarning: dask-ssh is deprecated and will be removed in a future release; use `dask ssh\` instead
warnings.warn(

---------------------------------------------------------------

                 Dask.distributed v2023.11.0


Worker nodes: 10: localhost

scheduler node: localhost:8786

[ worker localhost ] : D:\Users\fourb\anaconda3\envs\Py11\python.exe -m distributed.cli.dask_worker localhost:8786 --nthreads 0 --host localhost --memory-limit auto
[ scheduler localhost:8786 ] : D:\Users\fourb\anaconda3\envs\Py11\python.exe -m distributed.cli.dask_scheduler --port 8786
Exception in thread Thread-2 (async_ssh):
Traceback (most recent call last):
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\threading.py", line 1045, in _bootstrap_inner
self.run()
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\distributed\deploy\old_ssh.py", line 198, in async_ssh
channel.send(b"\x03")  # Ctrl-C
^^^^^^^^^^^^^^^^^^^^^
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\paramiko\channel.py", line 799, in send
return self._send(s, m)
^^^^^^^^^^^^^^^^
File "D:\Users\fourb\anaconda3\envs\Py11\Lib\site-packages\paramiko\channel.py", line 1196, in _send
raise socket.error("Socket is closed")
OSError: Socket is closed
0

There are 0 best solutions below