Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp communication error CancelledError #1073

Open
mens-artis opened this issue Sep 14, 2023 · 2 comments
Open

tcp communication error CancelledError #1073

mens-artis opened this issue Sep 14, 2023 · 2 comments
Labels

Comments

@mens-artis
Copy link

mens-artis commented Sep 14, 2023

I run the basics example 7.
It seems that there is a TCP Error at the end (that is expected), when the receiving end of the connection hangs up. This may have been introduced by the fix for [WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed., i.e.

try:
    self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
    logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
                  "Try increasing `worker_timeout`.")
    raise error

The problem is that this may obscure real problems.

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 491, in connect
    stream = await self.client.connect(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py", line 282, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 343, in connect
    comm = await wait_for(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/SMAC3_wrapped/example7.py", line 119, in <module>
    incumbent = smac.optimize()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
    incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
    self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 135, in submit_trial
    if self.count_available_workers() <= 0:
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 182, in count_available_workers
    return sum(self._client.nthreads().values()) - len(self._pending_trials)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/client.py", line 3919, in nthreads
    return self.sync(self.scheduler.ncores, workers=workers, **kwargs)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
    return sync(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 426, in sync
    raise exc.with_traceback(tb)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 399, in f
    result = yield future
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1626, in connect
    return connect_attempt.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1516, in _connect
    comm = await connect(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 369, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://....166.214:34417 after 30 s
msg: {'status': 'OK', 'time': 1694708544.0040095, 'heartbeat-interval': 0.5}
msg: {'status': 'OK', 'time': 1694708550.0110245, 'heartbeat-interval': 0.5}

With or without the sleep(6), there is a disconnect.

err:

cat tmp/smac_dask_slurm/dask-worker-9277642.err
cpu-bind=MASK - dlcgpu26, task  0  0 [25985]: mask 0x8000000080 set
2023-09-14 18:53:42,082 - distributed.nanny - INFO -         Start Nanny at: 'tcp://....166.186:34089'
2023-09-14 18:53:42,662 - distributed.worker - INFO -       Start worker at:   tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO -          Listening to:   tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO -           Worker name:             SLURMCluster-3
2023-09-14 18:53:42,662 - distributed.worker - INFO -          dashboard at:         ....166.186:46501
2023-09-14 18:53:42,662 - distributed.worker - INFO - Waiting to connect to:   tcp://....166.214:33319
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,662 - distributed.worker - INFO -               Threads:                          1
2023-09-14 18:53:42,662 - distributed.worker - INFO -                Memory:                   0.93 GiB
2023-09-14 18:53:42,662 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-ls7s1wou
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,669 - distributed.worker - INFO - Starting Worker plugin shuffle
2023-09-14 18:53:42,670 - distributed.worker - INFO -         Registered to:   tcp://....166.214:33319
2023-09-14 18:53:42,670 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,670 - distributed.core - INFO - Starting established connection to tcp://....166.214:33319
slurmstepd-dlcgpu26: error: *** JOB 9277642 ON dlcgpu26 CANCELLED AT 2023-09-14T18:54:25 ***
2023-09-14 18:54:25,422 - distributed.worker - INFO - Stopping worker at tcp://....166.186:38039. Reason: scheduler-close
2023-09-14 18:54:25,423 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>
Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 317, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 328, in write
    convert_stream_closed_error(self, e)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>: Stream is closed
2023-09-14 18:54:25,431 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://....166.186:34089'. Reason: scheduler-close
2023-09-14 18:54:25,442 - distributed.core - INFO - Received 'close-stream' from tcp://....166.214:33319; closing.
2023-09-14 18:54:25,443 - distributed.nanny - INFO - Worker closed

out

(venv3106) username@node1:$ cat tmp/smac_dask_slurm/dask-worker-9277642.out
/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py : trying for another 29.999988079071045 seconds
/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py-- timeout = None
msg: {'compression': None, 'python': (3, 10, 6), 'pickle-protocol': 5}
msg: {'op': 'register_nanny', 'address': 'tcp://....166.186:34089'}
@mens-artis mens-artis changed the title tcp communication error tcp communication error CancelledError Sep 14, 2023
@mens-artis
Copy link
Author

I wonder if it was always based on Dask.

@alexandertornede
Copy link
Contributor

Once again, thanks for reporting this. We will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants