Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The system crashes when running the “Hello NAS” example code with GPU #5746

Open
misagamisaga opened this issue Feb 21, 2024 · 4 comments

Comments

@misagamisaga
Copy link

The system crashes when running the “Hello NAS” example code with GPU

  • The crash occurs in three environments: colab, Windows11 system using conda, and Windows11 system without using conda.
  • In all three environments, I tried to downgrade pytorch to version 13.0, but it still crashes.
  • And it crashes when running in both .py and .ipynb modes.

My steps

  • colab: use pip to install nni and lightning, then run the “Hello NAS” example code.
  • Windows11: install pytorch (using the official installation instructions, which also installs torchvision), nni, lightning, ipykernel, jupyterlab

I cleared my environment beforehand, there are no extra package conflicts

The details of one of the errors

(Env: Windows11, using conda, pytorch2.2.0)
Before the crash, I saw a lot of python.exe in the task manager
I recorded the error at that time:

[2024-02-20 22:24:17] Creating experiment, Experiment ID: 5p9fhwgt
[2024-02-20 22:24:17] Starting web server...
[2024-02-20 22:24:20] Setting up...
[2024-02-20 22:24:20] Web portal URLs: http://26.26.26.1:8084 http://169.254.77.17:8084 http://169.254.202.152:8084 http://169.254.67.238:8084 http://192.168.101.15:8084 http://127.0.0.1:8084
[2024-02-20 22:24:21] Successfully update searchSpace.
[2024-02-20 22:24:21] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:24:21] Experiment initialized successfully. Starting exploration strategy...
[2024-02-20 22:24:59] ERROR: Strategy failed to execute.
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "f:\today\nni\try_nni.py", line 144, in <module>
    exp3.run(port=8084)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 236, in run
    return self._run_impl(port, wait_completion, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 205, in _run_impl
    self.start(port, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 270, in start
    self._start_engine_and_strategy()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 230, in _start_engine_and_strategy
    self.strategy.run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 170, in run
    self._run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\bruteforce.py", line 220, in _run
    if not self.wait_for_resource():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 100, in wait_for_resource
    if not self.engine.budget_available():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\execution\training_service.py", line 271, in budget_available
    return self.nodejs_binding.get_status() in ['INITIALIZED', 'RUNNING', 'TUNER_NO_MORE_TRIAL']
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 413, in get_status
    resp = rest.get(self.port, '/check-status', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 43, in get
    return request('get', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:24:59] Stopping experiment, please wait...
[2024-02-20 22:25:00] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:25:20] ERROR: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 171, in _stop_nni_manager
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:25:20] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2024-02-20 22:25:21] ERROR: Failed to receive command. Retry in 0s
@534145232
Copy link

I have the same issue.

@Imfire-waw
Copy link

same issue too......
But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU.

@ranranrannervous
Copy link

same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU.
but it is too slow

@zhxn30663
Copy link

It may caused by dwm.exe or NVIDIA driver. Updating GPU driver or changing to studio version didn't work.

Windows 11 22631.3447, Intel i9-14900HX, RTX4090, Nvidia studio driver 552.22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants