Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR (nni.runtime.msg_dispatcher_base/Thread-2) #5769

Open
C-Comfundo opened this issue Apr 12, 2024 · 0 comments
Open

ERROR (nni.runtime.msg_dispatcher_base/Thread-2) #5769

C-Comfundo opened this issue Apr 12, 2024 · 0 comments

Comments

@C-Comfundo
Copy link

Describe the issue:
I created the trial by nnictl create --config xx --p xxxx
For a while I use nnictl experiment --all to check it, and find it stopped. The dispatcher.log shows the error below.
But the corresponding process is still running in gpu.
btw in the last time I use nni, this error didn't occur. I don't know what caused it.

Environment:

  • NNI version: 2.10.1
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: linux
  • Server OS (for remote mode only):
  • Python version: 3.8.13
  • PyTorch/TensorFlow version: pytorch 1.10.1
  • Is conda/virtualenv/venv used?: conda
  • Is running in Docker?: no

Configuration:

  • Experiment config (remember to remove secrets!):
    trialCommand: CUDA_VISIBLE_DEVICES=0 python k+1_gan.py
    trialConcurrency: 2
    maxTrialNumber: 1000
    maxExperimentDuration: 200h
    experimentWorkingDirectory: "/home/yiran/codes/Knowledge-Enriched-DMI/nni-experiment"
    tuner:
    name: TPE
    classArgs:
    optimize_mode: maximize
    trainingService:
    platform: local

  • Search space:
    {
    "lr":{"_type":"choice","_value":[0.00005, 0.0001,0.0002, 0.0005, 0.001]},
    "beta1":{"_type":"choice","_value":[0.001, 0.0001, 0.00001]},
    "beta2": {"_type":"choice","_value":[0.9,0.999]},
    "lambda_e": {"_type":"choice","_value":[0.00005]}
    }

Log message:

  • nnimanager.log:
    [2024-04-12 18:48:34] INFO (main) Start NNI manager
    [2024-04-12 18:48:34] INFO (NNIDataStore) Datastore initialization done
    [2024-04-12 18:48:34] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
    [2024-04-12 18:48:34] INFO (RestServer) REST server started.
    [2024-04-12 18:48:35] INFO (NNIManager) Starting experiment: b7edpl94
    [2024-04-12 18:48:35] INFO (NNIManager) Setup training service...
    [2024-04-12 18:48:35] INFO (LocalTrainingService) Construct local machine training service.
    [2024-04-12 18:48:35] INFO (NNIManager) Setup tuner...
    [2024-04-12 18:48:35] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
    [2024-04-12 18:48:36] INFO (NNIManager) Add event listeners
    [2024-04-12 18:48:36] INFO (LocalTrainingService) Run local machine training service.
    [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: ID,
    [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}
    [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}
    [2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
    sequenceId: 0,
    hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}',
    index: 0
    },
    placementConstraint: { type: 'None', gpus: [] }
    }
    [2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
    sequenceId: 1,
    hyperParameters: {
    value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}',
    index: 0
    },
    placementConstraint: { type: 'None', gpus: [] }
    }
    [2024-04-12 18:48:51] INFO (NNIManager) Trial job ZlXeN status changed from WAITING to RUNNING
    [2024-04-12 18:48:51] INFO (NNIManager) Trial job Rh0Pn status changed from WAITING to RUNNING
    [2024-04-12 18:49:42] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection
    at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
    at WebSocket.emit (node:events:538:35)
    at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
    at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
    at Socket.emit (node:events:526:28)
    at TCP. (node:net:687:12)

  • dispatcher.log:
    [2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
    [2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
    [2024-04-12 18:48:36] INFO (nni.tuner.tpe/MainThread) Using random seed 1314744945
    [2024-04-12 18:48:36] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
    [2024-04-12 18:49:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
    Traceback (most recent call last):
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlerscommand
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 144, in handle_report_metric_data
    data['value'] = load(data['value'])
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load
    return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
    File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 259, in loads
    return _strip_loads(string, hook, True, **jsonkwargs)
    File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
    return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/init.py", line 370, in loads
    return cls(**kw).decode(s)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
    File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/decoders.py", line 46, in call
    map = hook(map, properties=self.properties)
    File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper
    return encoder(*args, **{k: v for k, v in kwargs.items() if k in names})
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 877, in _json_tricks_any_object_decode
    return _wrapped_cloudpickle_loads(b)
    File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 883, in _wrapped_cloudpickle_loads
    return cloudpickle.loads(b)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes
    return torch.load(io.BytesIO(b))
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
    File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
    RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
    [2024-04-12 18:49:40] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
    [2024-04-12 18:49:42] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

  • nnictl stdout and stderr:


Experiment b7edpl94 start: 2024-04-12 18:48:34.614673

node:events:504
throw er; // Unhandled 'error' event
^

Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
Emitted 'error' event at:
at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22)
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14)
at WebSocket.emit (node:events:538:35)
[... lines matching original stack trace ...]
at TCP. (node:net:687:12)
Thrown at:
at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at emit (node:events:538:35)
at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at emit (node:events:526:28)
at node:net:687:12

How to reproduce it?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant