Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple workers on a single GPU machine #430

Open
hamidralmasi opened this issue Feb 18, 2022 · 0 comments
Open

Running multiple workers on a single GPU machine #430

hamidralmasi opened this issue Feb 18, 2022 · 0 comments

Comments

@hamidralmasi
Copy link

I have two machines each equipped with one GPU. I want to run multiple workers on each machine. Is this possible in BytePS? I tried to run 4 worker processes (2 process on each machine) and 2 servers (1 server process on each machine) but the last 3 worker processes fail with the following error and the first worker is stuck. I ran the commands as I would do for a normal 1 worker per GPU machine (which works in that case)

BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64

[19:28:20] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[19:28:20] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[2022-02-17 19:29:02.800368: F byteps/common/operations.cc:290] Check failed: (size) > (0) init tensor size not larger than 0
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 254, in <module>
    launch_bps()
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 240, in launch_bps
    t[i].join()
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
    raise self.exc
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 192, in worker
    subprocess.check_call(command, env=my_env,
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64' returned non-zero exit status 134.

I have two questions here:

  1. Is it possible to run BytePS with multiple workers on a single GPU machine?
  2. Is it possible to run BytePS on CPU-only machines as the workers?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant