You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I run the basics example 7.
It seems that there is a TCP Error at the end (that is expected), when the receiving end of the connection hangs up. This may have been introduced by the fix for [WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed., i.e.
try:
self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
"Try increasing `worker_timeout`.")
raise error
The problem is that this may obscure real problems.
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 491, in connect
stream = await self.client.connect(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py", line 282, in connect
af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 343, in connect
comm = await wait_for(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/SMAC3_wrapped/example7.py", line 119, in <module>
incumbent = smac.optimize()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 135, in submit_trial
if self.count_available_workers() <= 0:
File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 182, in count_available_workers
return sum(self._client.nthreads().values()) - len(self._pending_trials)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/client.py", line 3919, in nthreads
return self.sync(self.scheduler.ncores, workers=workers, **kwargs)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
return sync(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 426, in sync
raise exc.with_traceback(tb)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 399, in f
result = yield future
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1626, in connect
return connect_attempt.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1516, in _connect
comm = await connect(
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 369, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://....166.214:34417 after 30 s
msg: {'status': 'OK', 'time': 1694708544.0040095, 'heartbeat-interval': 0.5}
msg: {'status': 'OK', 'time': 1694708550.0110245, 'heartbeat-interval': 0.5}
With or without the sleep(6), there is a disconnect.
err:
cat tmp/smac_dask_slurm/dask-worker-9277642.err
cpu-bind=MASK - dlcgpu26, task 0 0 [25985]: mask 0x8000000080 set
2023-09-14 18:53:42,082 - distributed.nanny - INFO - Start Nanny at: 'tcp://....166.186:34089'
2023-09-14 18:53:42,662 - distributed.worker - INFO - Start worker at: tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO - Listening to: tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO - Worker name: SLURMCluster-3
2023-09-14 18:53:42,662 - distributed.worker - INFO - dashboard at: ....166.186:46501
2023-09-14 18:53:42,662 - distributed.worker - INFO - Waiting to connect to: tcp://....166.214:33319
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,662 - distributed.worker - INFO - Threads: 1
2023-09-14 18:53:42,662 - distributed.worker - INFO - Memory: 0.93 GiB
2023-09-14 18:53:42,662 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-ls7s1wou
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,669 - distributed.worker - INFO - Starting Worker plugin shuffle
2023-09-14 18:53:42,670 - distributed.worker - INFO - Registered to: tcp://....166.214:33319
2023-09-14 18:53:42,670 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,670 - distributed.core - INFO - Starting established connection to tcp://....166.214:33319
slurmstepd-dlcgpu26: error: *** JOB 9277642 ON dlcgpu26 CANCELLED AT 2023-09-14T18:54:25 ***
2023-09-14 18:54:25,422 - distributed.worker - INFO - Stopping worker at tcp://....166.186:38039. Reason: scheduler-close
2023-09-14 18:54:25,423 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 317, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 328, in write
convert_stream_closed_error(self, e)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>: Stream is closed
2023-09-14 18:54:25,431 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://....166.186:34089'. Reason: scheduler-close
2023-09-14 18:54:25,442 - distributed.core - INFO - Received 'close-stream' from tcp://....166.214:33319; closing.
2023-09-14 18:54:25,443 - distributed.nanny - INFO - Worker closed
I run the basics example 7.
It seems that there is a TCP Error at the end (that is expected), when the receiving end of the connection hangs up. This may have been introduced by the fix for
[WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed.
, i.e.The problem is that this may obscure real problems.
With or without the sleep(6), there is a disconnect.
err:
out
The text was updated successfully, but these errors were encountered: