-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launcher has been unable to complete initialization #481
Comments
Probably worth looking at the driver pod logs. |
Do you mean mpi-operator controller's log? or elastic-worke's log? |
|
Also, have you tried the v2 controller? |
tensorflow-mnist-elastic-launcher's log
v2 controller refers to the mpi-operator controller of the v2beta1 version? |
I think you can specify which container to look at, and then you should select the initContainer to see what's going on.
Yes, but you need to install the v2 controller as well. |
@alculquicondor Sorry, I don't know if I accidentally closed this issue, can you reopen it for me? |
I also don't have permissions to do so. Unless this works? /reopen |
@alculquicondor: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@alculquicondor
|
Did you upgrade the controller to v2? What are the logs in the workers? |
Yes, I have switched to v2beta1, the worker's log is as follows
launcher's log
|
The logs seem to indicate that the network was not configured... Do you see any service for your MPIJob? |
Also which cluster provider are you using, if any? |
Any chance you are using istio? It's not currently supported. |
pi-launcher's logs
|
not using istio |
k8s version
|
Have you tried a newer version? 1.18 is significantly old and out-of-support. I can't guarantee that all the features we used in mpi-operator would be supported. |
Our E2E test runs on 1.21 https://github.com/kubeflow/mpi-operator/blob/master/v2/test/e2e/e2e_suite_test.go#L45 |
ok i'll try it on version 1.21 |
I think I ran into this same problem with 1.25. I was going to try and add an example with YuniKorn using the MPIJob but I ran into this issue. I'll see if I can figure it out. |
It might be helpful to examine the Service and Endpoint objects. |
You know, right after I posted this, it started working.. So I'm still a little confused on how these things work under the hood. So what I see is that it takes about 3-4 retries for it to work locally: I get this in my logs at first:
And it takes some time but then I get these logs (3rd try)
What I don't really understand is where pi-worker-0.pi-worker is coming from and how that maps to a service with no endpoint.
Endpoint object:
|
It's a headless service https://kubernetes.io/docs/concepts/services-networking/service/#headless-services Failures are expected while the network is being setup. That's why we use a Job that can retry. |
Hello, I tried to run the mpi-operator/examples/v1/horovod/tensorflow-mnist-elastic.yaml case, and found that the launcher has been unable to complete the Init, the log shows the following
kubectl get pods -o wide -n mpi-operator
kubectl logs tensorflow-mnist-elastic-launcher -n mpi-operator
How can I troubleshoot or solve
The text was updated successfully, but these errors were encountered: