Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about applying for nodes and gpus #558

Open
ThomaswellY opened this issue May 24, 2023 · 9 comments
Open

questions about applying for nodes and gpus #558

ThomaswellY opened this issue May 24, 2023 · 9 comments

Comments

@ThomaswellY
Copy link

ThomaswellY commented May 24, 2023

Hi, i have been using mpi-operator to achieve distributed training recently。
the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: cifar
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: mpijob-cifar-deepspeed-container
imagePullPolicy: Always
command:
- mpirun
- --allow-run-as-root
- python
- cifar/cifar10_deepspeed.py
- --epochs=100
- --deepspeed_mpi
- --deepspeed
- --deepspeed_config
- cifar/ds_config.json
env:
- name: OMP_NUM_THREADS
value: "1"
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: deepspeed-mpijob-container
resources:
limits:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 2
there are some questions i'm confused about:

  1. the content about applying for gpu-resources seems in "Worker". Does the cifar-worker-0 and cifar-worker-1 pods are separatly applying for an node(in k8s cluster) with 2 gpu? Then what role does "slotsPerWorker" play?
  2. I have excuted the "apply -f yaml" on the example yaml, with different replicas like "replicas: 1" ,"replicas: 4", and the resources limits was fixed at "nvidia.com/gpu: 1". I found interesting results :
    *When replicas is set to large numerber, It takes a bit more time for the cifar-launcher pod to complete.
    *the logs printed in cifar-launcher pod (when replicas: 4) were just like the result ( when replicas: 1) repeated 4 times.
    so does these mean, the four pods have separately applyed for one gpu (from node in k8s cluster, and preferentially from the same node if gpus are enough), and printed out the average result. the whole process had nothing to do with distribution?
    *by the way, when setting "repicas: 3" , there is error reported in my case:
    train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 !=21 * 1 * 3
    this did confuse me.
  3. If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?
    Thank in advance for your apply~
@tenzen-y
Copy link
Member

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

@alculquicondor
Copy link
Collaborator

or you can consider upgrading to the v2beta API :)

To answer some of your questions:
Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker.
You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

@ThomaswellY
Copy link
Author

ThomaswellY commented May 26, 2023

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

Thanks for your reply~
the api-resources of my k8s clusters in shown below:
(base) [root@gpu-233 operator]# kubectl api-resources | grep jobs
cronjobs cj batch/v1 true CronJob
jobs batch/v1 true Job
mpijobs kubeflow.org/v1 true MPIJob
mxjobs kubeflow.org/v1 true MXJob
pytorchjobs kubeflow.org/v1 true PyTorchJob
tfjobs kubeflow.org/v1 true TFJob
xgboostjobs kubeflow.org/v1 true XGBoostJob
doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
I have applied configs of the example yaml with kubeflow.org/v1 API successfully, and have seen no siginificant errors in pod logs.
@tenzen-y

@ThomaswellY
Copy link
Author

ThomaswellY commented May 26, 2023

Thanks for your reply~
I am a little confused about which type of API can support my resource (mpijob in my case).
The command "kubectl api-resources" shows mpijobs in my k8s cluster is supported by kubeflow.org/v1,
if not, what is the suitable way to confirm which API can support my mpijobs-resource? any official docs would be helpful~

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu".
in general, there are some things i wanna confirm:

  1. How to confirm which API can support the mpi-operator, if "kubectl api-resources" did not work, then which command should be submitted?
  2. when resource limit sets gpu number to 1 ( because one node of k8s cluster has only one gpu available in this case), then distributed training can not be launched, even multi-pod can separately executes single-gpu training when set replicas>1, it's in fact a repetitive behavior of single-training.
  3. If i have node-1 with 2 gpus and node-2 with 4 gpus, the most effective distributed training that mpi-operator can launcher is about 2 nodes with 2 gpus per node, and the ideal config is that setting "slotsPerWorker: 2","replicas: 2",and "nvidia.com/gpu: 2".
    The questions are a little too many, I am sorry if that troubles you.
    Thanks you in advance~
    @alculquicondor

@alculquicondor
Copy link
Collaborator

doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?

That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo.
If you wish to use the newer v2beta1 version, you have to disable training-operator and install the operator in this repo https://github.com/kubeflow/mpi-operator#installation

The rest of the questions:

  1. the command did work, you are running v1.
  2. It sounds like a problem in your application, not mpi-operator. Did you miss any parameters in your command? I'm not familiar with deepspeed.
  3. yes

@tenzen-y
Copy link
Member

@ThomaswellY
Thanks @alculquicondor.
Yes, I meant this repo doesn't support kubeflow.org/v1, and this repo supports only kubeflow.org/v2beta1.
Currently, the kubeflow.org/v1 is supported in https://github.com/kubeflow/training-operator.

Also, I would suggest v2beta1 MPIJob for the deepspeed since kubeflow/training-operator#1792 (comment).

@alculquicondor
Copy link
Collaborator

Also it seems that #549 has proof that v2beta1 can run deepspeed

@ThomaswellY
Copy link
Author

@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed.
Anyway, I have executed #549 successfully even in v1, however it seems only cifar10_deepspeed.py needs no modifications, as for gan_deepspeed_train.py, the extra modification is necessary (like args.local_rank = int(os.environ['LOCAL_RANK'])).
So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

@tenzen-y
Copy link
Member

@ThomaswellY Thank you for the report!

So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

Feel free to open PRs. I'm happy to review them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants