Launcher has been unable to complete initialization #481

LY-today · 2022-11-02T14:55:21Z

Hello, I tried to run the mpi-operator/examples/v1/horovod/tensorflow-mnist-elastic.yaml case, and found that the launcher has been unable to complete the Init, the log shows the following

kubectl get pods -o wide -n mpi-operator

tensorflow-mnist-elastic-launcher   0/1     Init:0/1   0          6m11s
tensorflow-mnist-elastic-worker-0   1/1     Running    0          6m11s
tensorflow-mnist-elastic-worker-1   1/1     Running    0          6m11s

kubectl logs tensorflow-mnist-elastic-launcher -n mpi-operator

Error from server (BadRequest): container "mpi-launcher" in pod "tensorflow-mnist-elastic-launcher" is waiting to start: PodInitializing

How can I troubleshoot or solve

The text was updated successfully, but these errors were encountered:

alculquicondor · 2022-11-02T14:59:08Z

Probably worth looking at the driver pod logs.

LY-today · 2022-11-02T15:04:49Z

Probably worth looking at the driver pod logs.

Do you mean mpi-operator controller's log? or elastic-worke's log?

LY-today · 2022-11-02T15:16:27Z

Probably worth looking at the driver pod logs.

I1102 15:12:47.609487       1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"mpi-operator", Name:"tensorflow-mnist-elastic", UID:"8b7c0c2c-37ec-46aa-b2b3-19dd8692ef20", APIVersion:"kubeflow.org/v1", ResourceVersion:"59951895", FieldPath:""}): type: 'Normal' reason: 'MPIJobCreated' MPIJob mpi-operator/tensorflow-mnist-elastic is created.
I1102 15:12:47.661816       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (47.272401ms)
I1102 15:12:47.661841       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.667544       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (5.684899ms)
E1102 15:12:47.667580       1 mpi_job_controller.go:426] error syncing 'mpi-operator/tensorflow-mnist-elastic': Operation cannot be fulfilled on mpijobs.kubeflow.org "tensorflow-mnist-elastic": the object has been modified; please apply your changes to the latest version and try again
I1102 15:12:47.667733       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (136.121µs)
I1102 15:12:47.667754       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.674149       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (148.304µs)
I1102 15:12:47.674170       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.684830       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (121.306µs)
I1102 15:12:47.684847       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.699492       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (122.873µs)
I1102 15:12:47.699511       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:47.733078       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (137.972µs)
I1102 15:12:47.733095       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:48.907583       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (12.006217ms)
I1102 15:12:48.907617       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:48.911159       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (158.817µs)
I1102 15:12:48.911178       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.453188       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (11.34971ms)
I1102 15:12:49.453214       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.457259       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (131.943µs)
I1102 15:12:49.457277       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic'
I1102 15:12:49.689400       1 mpi_job_controller.go:439] Finished syncing job "mpi-operator/tensorflow-mnist-elastic" (146.521µs)
I1102 15:12:49.689418       1 mpi_job_controller.go:421] Successfully synced 'mpi-operator/tensorflow-mnist-elastic

alculquicondor · 2022-11-02T15:29:19Z

tensorflow-mnist-elastic-launcher

Also, have you tried the v2 controller?

LY-today · 2022-11-02T15:36:02Z

tensorflow-mnist-elastic-launcher

Also, have you tried the v2 controller?

tensorflow-mnist-elastic-launcher's log

Error from server (BadRequest): container "mpi-launcher" in pod "tensorflow-mnist-elastic-launcher" is waiting to start: PodInitializing

v2 controller refers to the mpi-operator controller of the v2beta1 version?

alculquicondor · 2022-11-02T15:37:38Z

I think you can specify which container to look at, and then you should select the initContainer to see what's going on.

v2 controller refers to the mpi-operator controller of the v2beta1 version?

Yes, but you need to install the v2 controller as well.

LY-today · 2022-11-02T15:39:34Z

@alculquicondor Sorry, I don't know if I accidentally closed this issue, can you reopen it for me?

alculquicondor · 2022-11-02T16:38:57Z

I also don't have permissions to do so. Unless this works?

/reopen

google-oss-prow · 2022-11-02T16:39:00Z

@alculquicondor: Reopened this issue.

In response to this:

I also don't have permissions to do so. Unless this works?

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LY-today · 2022-11-03T12:06:35Z

@alculquicondor
I tried to use the v2 beta1 version. At present, when I created mpi-operator/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml, I found that CrashLoopBackOff appeared in the launcher, and the log showed

ssh: Could not resolve hostname tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Name or service not known
ssh: Could not resolve hostname tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-benchmarks-launcher
  target node:  tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

alculquicondor · 2022-11-03T14:35:19Z

Did you upgrade the controller to v2? What are the logs in the workers?

LY-today · 2022-11-03T15:00:18Z

Did you upgrade the controller to v2? What are the logs in the workers?

Yes, I have switched to v2beta1, the worker's log is as follows

Server listening on 0.0.0.0 port 22.
Server listening on :: port 22

launcher's log

ssh: Could not resolve hostname tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Name or service not known
ssh: Could not resolve hostname tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-benchmarks-launcher
  target node:  tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

alculquicondor · 2022-11-03T15:16:42Z

The logs seem to indicate that the network was not configured... Do you see any service for your MPIJob?
Have you tried the pi example? https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1/pi

alculquicondor · 2022-11-03T15:18:35Z

Also which cluster provider are you using, if any?

alculquicondor · 2022-11-03T15:19:18Z

Any chance you are using istio? It's not currently supported.

LY-today · 2022-11-04T07:13:25Z

The logs seem to indicate that the network was not configured... Do you see any service for your MPIJob? Have you tried the pi example? https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1/pi

pi-launcher's logs

ssh: Could not resolve hostname pi-worker-1.pi-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   pi-launcher
  target node:  pi-worker-0.pi-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

LY-today · 2022-11-04T07:15:12Z

Any chance you are using istio? It's not currently supported.

not using istio

LY-today · 2022-11-04T07:16:03Z

Also which cluster provider are you using, if any?

k8s version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"archive", BuildDate:"2020-05-20T04:14:03Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"archive", BuildDate:"2020-05-20T04:11:48Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

alculquicondor · 2022-11-04T14:11:53Z

Have you tried a newer version? 1.18 is significantly old and out-of-support.

I can't guarantee that all the features we used in mpi-operator would be supported.

alculquicondor · 2022-11-04T14:17:45Z

Our E2E test runs on 1.21 https://github.com/kubeflow/mpi-operator/blob/master/v2/test/e2e/e2e_suite_test.go#L45

LY-today · 2022-11-07T06:44:06Z

Our E2E test runs on 1.21 https://github.com/kubeflow/mpi-operator/blob/master/v2/test/e2e/e2e_suite_test.go#L45

ok i'll try it on version 1.21

kannon92 · 2022-11-10T20:20:51Z

I think I ran into this same problem with 1.25. I was going to try and add an example with YuniKorn using the MPIJob but I ran into this issue.

I'll see if I can figure it out.

alculquicondor · 2022-11-10T20:45:02Z

It might be helpful to examine the Service and Endpoint objects.

kannon92 · 2022-11-10T21:20:16Z

You know, right after I posted this, it started working..

So I'm still a little confused on how these things work under the hood.

So what I see is that it takes about 3-4 retries for it to work locally:

I get this in my logs at first:

Warning: Permanently added 'pi-worker-0.pi-worker,10.244.0.26' (ECDSA) to the list of known hosts.
ssh: Could not resolve hostname pi-worker-1.pi-worker: Name or service not known

And it takes some time but then I get these logs (3rd try)

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl logs pi-launcher-z7v5g
Warning: Permanently added 'pi-worker-0.pi-worker,10.244.0.26' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-1.pi-worker,10.244.0.28' (ECDSA) to the list of known hosts.
Workers: 2
Rank 0 on host pi-worker-0
Rank 1 on host pi-worker-1
pi is approximately 3.1410376000000002

What I don't really understand is where pi-worker-0.pi-worker is coming from and how that maps to a service with no endpoint.
I see the following service object:

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl describe service pi-worker
Name:              pi-worker
Namespace:         default
Labels:            app=pi
Annotations:       <none>
Selector:          training.kubeflow.org/job-name=pi,training.kubeflow.org/job-role=worker,training.kubeflow.org/operator-name=mpi-operator
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Session Affinity:  None
Events:            <none>

Endpoint object:

[ec2-user@ip-172-31-93-184 mpi-operator-kevin]$ kubectl describe endpoints pi-worker
Name:         pi-worker
Namespace:    default
Labels:       app=pi
              service.kubernetes.io/headless=
Annotations:  <none>
Subsets:
Events:  <none>

alculquicondor · 2022-11-11T13:17:07Z

It's a headless service https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Failures are expected while the network is being setup. That's why we use a Job that can retry.

LY-today changed the title ~~mpi-operator/examples/v1/horovod/tensorflow-mnist-elastic.yaml is not running~~ Launcher has been unable to complete initialization Nov 2, 2022

LY-today closed this as completed Nov 2, 2022

google-oss-prow bot reopened this Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launcher has been unable to complete initialization #481

Launcher has been unable to complete initialization #481

LY-today commented Nov 2, 2022 •

edited

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

google-oss-prow bot commented Nov 2, 2022

LY-today commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

LY-today commented Nov 3, 2022 •

edited

alculquicondor commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

LY-today commented Nov 4, 2022

LY-today commented Nov 4, 2022

LY-today commented Nov 4, 2022 •

edited

alculquicondor commented Nov 4, 2022

alculquicondor commented Nov 4, 2022

LY-today commented Nov 7, 2022

kannon92 commented Nov 10, 2022

alculquicondor commented Nov 10, 2022

kannon92 commented Nov 10, 2022

alculquicondor commented Nov 11, 2022

Launcher has been unable to complete initialization #481

Launcher has been unable to complete initialization #481

Comments

LY-today commented Nov 2, 2022 • edited

kubectl get pods -o wide -n mpi-operator

kubectl logs tensorflow-mnist-elastic-launcher -n mpi-operator

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

LY-today commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

google-oss-prow bot commented Nov 2, 2022

LY-today commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

LY-today commented Nov 3, 2022 • edited

alculquicondor commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

alculquicondor commented Nov 3, 2022

LY-today commented Nov 4, 2022

LY-today commented Nov 4, 2022

LY-today commented Nov 4, 2022 • edited

alculquicondor commented Nov 4, 2022

alculquicondor commented Nov 4, 2022

LY-today commented Nov 7, 2022

kannon92 commented Nov 10, 2022

alculquicondor commented Nov 10, 2022

kannon92 commented Nov 10, 2022

alculquicondor commented Nov 11, 2022

LY-today commented Nov 2, 2022 •

edited

LY-today commented Nov 3, 2022 •

edited

LY-today commented Nov 4, 2022 •

edited