Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launcher has been unable to complete initialization #481

Open
LY-today opened this issue Nov 2, 2022 · 25 comments
Open

Launcher has been unable to complete initialization #481

LY-today opened this issue Nov 2, 2022 · 25 comments

Comments

@LY-today
Copy link

LY-today commented Nov 2, 2022

Hello, I tried to run the mpi-operator/examples/v1/horovod/tensorflow-mnist-elastic.yaml case, and found that the launcher has been unable to complete the Init, the log shows the following

kubectl get pods -o wide -n mpi-operator

tensorflow-mnist-elastic-launcher   0/1     Init:0/1   0          6m11s
tensorflow-mnist-elastic-worker-0   1/1     Running    0          6m11s
tensorflow-mnist-elastic-worker-1   1/1     Running    0          6m11s

kubectl logs tensorflow-mnist-elastic-launcher -n mpi-operator

Error from server (BadRequest): container "mpi-launcher" in pod "tensorflow-mnist-elastic-launcher" is waiting to start: PodInitializing

How can I troubleshoot or solve

@LY-today LY-today changed the title mpi-operator/examples/v1/horovod/tensorflow-mnist-elastic.yaml is not running Launcher has been unable to complete initialization Nov 2, 2022
@alculquicondor
Copy link
Collaborator

Probably worth looking at the driver pod logs.

@LY-today
Copy link
Author

LY-today commented Nov 2, 2022

Probably worth looking at the driver pod logs.

Do you mean mpi-operator controller's log? or elastic-worke's log?

@LY-today
Copy link
Author

LY-today commented Nov 2, 2022

Probably worth looking at the driver pod logs.

I1102 15:12:47.609487       1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"mpi-operator", Name:"tensorflow-mnist-elastic", UID:"8b7c0c2c-37ec-46aa-b2b3-19dd8692ef20", APIVersion:"kubeflow.org/v1", ResourceVersion:"59951895", FieldPath:""}): type: 'Normal' reason: 'MPIJobCreated' MPIJob mpi-operator/tensorflow-mnist-elastic is created.
I1102 15:12:47.661816       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (47.272401ms)
I1102 15:12:47.661841       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.667544       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (5.684899ms)
E1102 15:12:47.667580       1 mpi_job_controller.go:426] error syncing 'mpi-operator/tensorflow-mnist-elastic': Operation cannot be fulfilled on mpijobs.kubeflow.org "tensorflow-mnist-elastic": the object has been modified; please apply your changes to the latest version and try again
I1102 15:12:47.667733       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (136.121µs)
I1102 15:12:47.667754       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.674149       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (148.304µs)
I1102 15:12:47.674170       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.684830       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (121.306µs)
I1102 15:12:47.684847       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.699492       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (122.873µs)
I1102 15:12:47.699511       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.733078       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (137.972µs)
I1102 15:12:47.733095       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:48.907583       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (12.006217ms)
I1102 15:12:48.907617       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:48.911159       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (158.817µs)
I1102 15:12:48.911178       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.453188       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (11.34971ms)
I1102 15:12:49.453214       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.457259       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (131.943µs)
I1102 15:12:49.457277       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.689400       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (146.521µs)
I1102 15:12:49.689418       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic

@alculquicondor
Copy link
Collaborator

tensorflow-mnist-elastic-launcher

Also, have you tried the v2 controller?

@LY-today LY-today closed this as completed Nov 2, 2022
@LY-today
Copy link
Author

LY-today commented Nov 2, 2022

tensorflow-mnist-elastic-launcher

Also, have you tried the v2 controller?

tensorflow-mnist-elastic-launcher's log

Error from server (BadRequest): container "mpi-launcher" in pod "tensorflow-mnist-elastic-launcher" is waiting to start: PodInitializing

v2 controller refers to the mpi-operator controller of the v2beta1 version?

@alculquicondor
Copy link
Collaborator

I think you can specify which container to look at, and then you should select the initContainer to see what's going on.

v2 controller refers to the mpi-operator controller of the v2beta1 version?

Yes, but you need to install the v2 controller as well.

@LY-today
Copy link
Author

LY-today commented Nov 2, 2022

@alculquicondor Sorry, I don't know if I accidentally closed this issue, can you reopen it for me?

@alculquicondor
Copy link
Collaborator

I also don't have permissions to do so. Unless this works?

/reopen

@google-oss-prow
Copy link

@alculquicondor: Reopened this issue.

In response to this:

I also don't have permissions to do so. Unless this works?

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot reopened this Nov 2, 2022
@LY-today
Copy link
Author

LY-today commented Nov 3, 2022

@alculquicondor
I tried to use the v2 beta1 version. At present, when I created mpi-operator/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml, I found that CrashLoopBackOff appeared in the launcher, and the log showed

ssh: Could not resolve hostname tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Name or service not known
ssh: Could not resolve hostname tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-benchmarks-launcher
  target node:  tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

@alculquicondor
Copy link
Collaborator

Did you upgrade the controller to v2? What are the logs in the workers?

@LY-today
Copy link
Author

LY-today commented Nov 3, 2022

Did you upgrade the controller to v2? What are the logs in the workers?

Yes, I have switched to v2beta1, the worker's log is as follows

Server listening on 0.0.0.0 port 22.
Server listening on :: port 22

launcher's log

ssh: Could not resolve hostname tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Name or service not known
ssh: Could not resolve hostname tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-benchmarks-launcher
  target node:  tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

@alculquicondor
Copy link
Collaborator

The logs seem to indicate that the network was not configured... Do you see any service for your MPIJob?
Have you tried the pi example? https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1/pi

@alculquicondor
Copy link
Collaborator

Also which cluster provider are you using, if any?

@alculquicondor
Copy link
Collaborator

Any chance you are using istio? It's not currently supported.

@LY-today
Copy link
Author

LY-today commented Nov 4, 2022

The logs seem to indicate that the network was not configured... Do you see any service for your MPIJob? Have you tried the pi example? https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1/pi

pi-launcher's logs

ssh: Could not resolve hostname pi-worker-1.pi-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   pi-launcher
  target node:  pi-worker-0.pi-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

@LY-today
Copy link
Author

LY-today commented Nov 4, 2022

Any chance you are using istio? It's not currently supported.

not using istio

@LY-today
Copy link
Author

LY-today commented Nov 4, 2022

Also which cluster provider are you using, if any?

k8s version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"archive", BuildDate:"2020-05-20T04:14:03Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"archive", BuildDate:"2020-05-20T04:11:48Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

@alculquicondor
Copy link
Collaborator

Have you tried a newer version? 1.18 is significantly old and out-of-support.

I can't guarantee that all the features we used in mpi-operator would be supported.

@alculquicondor
Copy link
Collaborator

@LY-today
Copy link
Author

LY-today commented Nov 7, 2022

Our E2E test runs on 1.21 https://github.com/kubeflow/mpi-operator/blob/master/v2/test/e2e/e2e_suite_test.go#L45

ok i'll try it on version 1.21

@kannon92
Copy link

I think I ran into this same problem with 1.25. I was going to try and add an example with YuniKorn using the MPIJob but I ran into this issue.

I'll see if I can figure it out.

@alculquicondor
Copy link
Collaborator

It might be helpful to examine the Service and Endpoint objects.

@kannon92
Copy link

You know, right after I posted this, it started working..

So I'm still a little confused on how these things work under the hood.

So what I see is that it takes about 3-4 retries for it to work locally:

I get this in my logs at first:

Warning: Permanently added 'pi-worker-0.pi-worker,10.244.0.26' (ECDSA) to the list of known hosts.
ssh: Could not resolve hostname pi-worker-1.pi-worker: Name or service not known

And it takes some time but then I get these logs (3rd try)

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl logs pi-launcher-z7v5g
Warning: Permanently added 'pi-worker-0.pi-worker,10.244.0.26' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-1.pi-worker,10.244.0.28' (ECDSA) to the list of known hosts.
Workers: 2
Rank 0 on host pi-worker-0
Rank 1 on host pi-worker-1
pi is approximately 3.1410376000000002

What I don't really understand is where pi-worker-0.pi-worker is coming from and how that maps to a service with no endpoint.
I see the following service object:

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl describe service pi-worker
Name:              pi-worker
Namespace:         default
Labels:            app=pi
Annotations:       <none>
Selector:          training.kubeflow.org/job-name=pi,training.kubeflow.org/job-role=worker,training.kubeflow.org/operator-name=mpi-operator
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Session Affinity:  None
Events:            <none>

Endpoint object:

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl describe endpoints pi-worker
Name:         pi-worker
Namespace:    default
Labels:       app=pi
              service.kubernetes.io/headless=
Annotations:  <none>
Subsets:
Events:  <none>

@alculquicondor
Copy link
Collaborator

It's a headless service https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Failures are expected while the network is being setup. That's why we use a Job that can retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants