When I execute “torchrun --nproc_per_node 1 llamacpp_mock_api.py”, the following error occurs. #6

HwJhx · 2023-09-04T12:43:17Z

torchrun --nproc_per_node 1 llamacpp_mock_api.py
--ckpt_dir CodeLlama-7b-Instruct/
--tokenizer_path CodeLlama-7b-Instruct/tokenizer.model
--max_seq_len 128 --max_batch_size 4

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 16713) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
llamacpp_mock_api.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-09-04_12:12:41
host : 13edd873e909
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 16713)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 16713

HwJhx · 2023-09-04T12:44:32Z

My GPU Info as below:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

BoazimMatrix · 2023-10-01T08:52:44Z

Did you figure it out? I have the same problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I execute “torchrun --nproc_per_node 1 llamacpp_mock_api.py”, the following error occurs. #6

When I execute “torchrun --nproc_per_node 1 llamacpp_mock_api.py”, the following error occurs. #6

HwJhx commented Sep 4, 2023

HwJhx commented Sep 4, 2023

BoazimMatrix commented Oct 1, 2023

When I execute “torchrun --nproc_per_node 1 llamacpp_mock_api.py”, the following error occurs. #6

When I execute “torchrun --nproc_per_node 1 llamacpp_mock_api.py”, the following error occurs. #6

Comments

HwJhx commented Sep 4, 2023

Failures: <NO_OTHER_FAILURES>

HwJhx commented Sep 4, 2023

BoazimMatrix commented Oct 1, 2023

Failures:
<NO_OTHER_FAILURES>