Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault while launching the worker #443

Open
xuexiaxie opened this issue May 13, 2023 · 1 comment
Open

segmentation fault while launching the worker #443

xuexiaxie opened this issue May 13, 2023 · 1 comment

Comments

@xuexiaxie
Copy link

When I started the worker to use distributed training with the following environment configuration, I got a segmentation error. The error message is as follows:

BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-5,24-29 python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10

[20:24:40] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[20:24:40] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[20:24:40] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[[20:24:40] src/van.cc:581: Bind to [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1]20:24:40] src/./zmq_van.h:351: Start ZMQ recv thread

[20:24:40] src/./zmq_van.h:159: Zmq connecting to node [role=scheduler, id=1, ip=192.168.108.228, port=1234, is_recovery=0, aux_id=-1, num_ports=1]. My node is [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1]
[20:24:40] src/van.cc:673: zeromq 32767 sent: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!
Segmentation fault (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in
import('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 281, in
launch_bps()
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 267, in launch_bps
join_threads(t)
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 230, in join_threads
threads[idx].join()
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 40, in join
raise self.exc
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 31, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 199, in worker
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-5,24-29 python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10' returned non-zero exit status 139.

my envs and command:

export DMLC_ROLE=worker
export DMLC_PS_ROOT_URI=192.168.108.228
export DMLC_PS_ROOT_PORT=1234
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=1
export DMLC_INTERFACE=eno1
export NVIDIA_VISIBLE_DEVICES=2
export BYTEPS_FORCE_DISTRIBUTED=1
export PS_VERBOSE=2
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10

@yxwdsb
Copy link

yxwdsb commented May 15, 2023

看看你在其他所有节点上的envs and command是什么样的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants