Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. #418

Open
jackjinj opened this issue Dec 16, 2021 · 0 comments
Open

The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. #418

jackjinj opened this issue Dec 16, 2021 · 0 comments

Comments

@jackjinj
Copy link

Describe the bug
A clear and concise description of what the bug is.
The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.

To Reproduce
Steps to reproduce the behavior:

  1. Prepared Kubernetes 1.19
  2. Installed kubeflow 1.2 which has mxjob operator
  3. Download the yaml from https://github.com/kubeflow/mxnet-operator/blob/master/examples/train/byteps_dist_gpu_v1.yaml
  4. kubectl apply -f byteps_dist_gpu_v1.yaml
  5. kubect get pod:
    byteps-mxnet-job-scheduler-0 1/1 Running 0 8s
    byteps-mxnet-job-server-0 1/1 Running 0 8s
    byteps-mxnet-job-server-1 1/1 Running 0 8s
    byteps-mxnet-job-worker-0 0/1 Completed 0 8s
    byteps-mxnet-job-worker-1 0/1 Completed 0 7s

$ kubectl describe pod byteps-mxnet-job-worker-0
You can see DMLC_WORKER_ID is not there
DMLC_PS_ROOT_PORT: 9091
DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0
DMLC_NUM_SERVER: 2
DMLC_NUM_WORKER: 2
DMLC_ROLE: worker
DMLC_USE_KUBERNETES: 1

To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch
command: ["/bin/bash", "-c"]
args: [
"sleep 3600"
]

command: ["bpslaunch"]

args: ["python3", "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", "--benchmark", "1", "--batch-size=32"]

Then apply the yaml to let the Pod run:
byteps-mxnet-job-server-0 1/1 Running 0 15s
byteps-mxnet-job-server-1 1/1 Running 0 15s
byteps-mxnet-job-worker-0 1/1 Running 0 15s
byteps-mxnet-job-worker-1 1/1 Running 0 14s

Then login as below:
$ kubectl exec -it byteps-mxnet-job-worker-0 -- bash
root@byteps-mxnet-job-worker-0:/#
root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID
root@byteps-mxnet-job-worker-0:/# bpslaunch
BytePS launching worker
The env DMLC_WORKER_ID is missing

Expected behavior
A clear and concise description of what you expected to happen.
Expect to see the worker pod running

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS:
  • GCC version:
  • CUDA and NCCL version:
  • Framework (TF, PyTorch, MXNet):

Additional context
Add any other context about the problem here.

If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?

Thanks

Jack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant