Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storycloze_gen 测试不通过,显示torch.distributed.elastic.multiprocessing.api.SignalException [Feature] #1142

Open
1 task
dh12306 opened this issue May 12, 2024 · 0 comments

Comments

@dh12306
Copy link

dh12306 commented May 12, 2024

Describe the feature

测试集不通过,报错如下:

 6%|▋         | 117/1871 [30:11<6:33:49, 13.47s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 118/1871 [30:23<6:22:25, 13.09s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 119/1871 [30:41<7:04:45, 14.55s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 120/1871 [30:59<7:34:04, 15.56s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 121/1871 [31:17<7:54:24, 16.27s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 122/1871 [31:34<8:08:38, 16.76s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 123/1871 [31:50<7:58:26, 16.42s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 124/1871 [32:04<7:39:55, 15.80s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 125/1871 [32:22<7:58:09, 16.43s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 126/1871 [32:40<8:10:50, 16.88s/it]Keyword arguments {'add_special_tokens': False} not recognized.
[2024-05-12 13:31:48,848] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-05-12 13:31:48,849] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 34836 closing signal SIGHUP
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
    time.sleep(monitor_interval)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 34780 got signal: 1

加载的是本地模型,这是运行命令,是不是分布式出问题了?

python run.py --datasets storycloze_gen --hf-path /home/ec2-user/models/Llama-2-13b-chat-hf  \
--tokenizer-path /home/ec2-user/models/Llama-2-13b-chat-hf --model-kwargs device_map='auto' \
 --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False  \
--max-out-len 100  --max-seq-len 2048 --batch-size 1 --no-batch-padding  \
--num-gpus 4  --max-workers-per-gpu 1 --accelerator hf 

Will you implement it?

  • I would like to implement this feature and create a PR!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant