Skip to content

DeepSpeed Multi node

Romain Beaumont edited this page May 26, 2021 · 1 revision

Running deepspeed with multiple node with GPUs allow to increase the batch size by xN with N the number of machines and the sample/s increase almost linearly.

Detailled information are present in https://www.deepspeed.ai/getting-started/

For dalle pytorch in particular, add a --hostfile=deepspeed_host argument right after deepspeed in the command line. deepspeed_host file should look like this:

my_machine1 slots=2
my_machine2 slots=2

The slot number is the number of GPUs. my_machine1 and my_machine2 should be machine that can be connected to with ssh my_machine1 without password. You need to define them in ~/.ssh/config like this:

Host gpu2
     User youruser
     HostName 1.2.3.4
     Port 22

If using a virtual environment, you will need to do source .env/bin/activate in the bashrc file of each machine.