MPIJobs and Istio #480

kimwnasptd · 2022-11-02T14:06:53Z

I tried to run some MPIJobs with Istio enabled in the user namespaces but have bumped into a couple of issues. I'll use this issue to expose the bugs that occurred as well as proposed solutions. Although we might need to break this into smaller issues.

I used the tensorflow-benchmarks example, so this will be my point of reference.

The problems we've observed are the following:

The main container will need to wait for the sidecar to start
- We can use Istio's proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}' annotation on the pods
Workers communicate with the Launcher Pod via Pod IPs, which goes through Istio's PassthroughCluster. This could be a problem in more security strict environments where mTLS mode is STRICT (pod-to-pod is not mTLS so requests will be blocked)
- https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#headless-services
- https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#unmatched-traffic
The Launcher will use kubectl exec which can use the sidecar
- We could extend the controller to set the kubectl.kubernetes.io/default-container annotation in the worker pods

Apologies if there were duplicate issues for Istio and MPIJob. Are there any plans for mitigating some of these issues and ensuring MPIJob can work with Istio?

Also if I need to open this issue in kubeflow/training-operator please tell me and I'll open a new one.

The text was updated successfully, but these errors were encountered:

kimwnasptd · 2022-11-02T14:44:58Z

I'll actually create issue in the training-operator repo, since IIUC this controller's code has been migrated in that repo

kimwnasptd · 2022-11-02T14:45:49Z

Closing this in favor of kubeflow/training-operator#1681

alculquicondor · 2022-11-02T14:56:46Z

It's hasn't been fully migrated. There is a new version of the controller that uses ssh instead of exec, which might be a step in the right direction.

We discussed this topic before #429, but there is no solution.

alculquicondor · 2022-11-02T14:58:04Z

Probably the solution is along the lines of istio using a TCP proxy, instead of an HTTPS proxy.

alculquicondor · 2022-11-08T14:25:50Z

/reopen

Have you tried with the v2 controller?

google-oss-prow · 2022-11-08T14:25:54Z

@alculquicondor: Reopened this issue.

In response to this:

/reopen

Have you tried with the v2 controller?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kimwnasptd · 2022-11-16T14:04:00Z

Orthogonal to the technical discussion, but shouldn't we close this issue and keep this context on kubeflow/training-operator#1681? Why was this issue re-opened? Isn't the main development of MPI Operator now happening in Training Operator repo?

alculquicondor · 2022-11-16T14:12:00Z

Not all of it. The v2 operator hasn't been migrated yet.

kimwnasptd · 2022-11-16T14:21:58Z

Then on the technical aspect

There is a new version of the controller that uses ssh instead of exec, which might be a step in the right direction.

Doesn't using ssh require setting up an ssh agent in all workers and launcher, as well as adding keys to the authorized_keys file of each worker? Why is there a preference for this over

Only installing kubectl in the launcher
Configuring the kubectl.kubernetes.io/default-container annotation in each worker, which ensures the correct container will be used from the launcher

Probably the solution is along the lines of istio using a TCP proxy, instead of an HTTPS proxy.

Could you clarify a little bit what you mean here? Are you referring to the sidecars? Which HTTPs proxy did you have in mind?

Have you tried with the v2 controller?

How can I try this? Are the manfiests for this controller part of this repo? I currently have a KF cluster, so can I have both these standalone manifests and Training Operator at the same time?

Also I have some more questions around the relationship between this repo and Training Operator, but I'm pretty sure this discussion has been had somewhere else. Could you point me to any docs or issues? If we don't have any I can raise a new issue, to not pollute this one with that discussion

alculquicondor · 2022-11-16T14:27:48Z

Using kubectl exec implies that every job startup sequence involves a tunnel through the API server. The connection stays up for the entirety of the runtime. This is not scalable.

Could you clarify a little bit what you mean here? Are you referring to the sidecars? Which HTTPs proxy did you have in mind?

TBH, I'm not familiar with istio. I picked up that suggestion from #429.

How can I try this? Are the manfiests for this controller part of this repo? I currently have a KF cluster, so can I have both these standalone manifests and Training Operator at the same time?

Yes, they are in this repo. I don't think you can have both running at the same time. IIUC, there is a way to disable specific operators in training-operator, but I'm not familiar.

Also I have some more questions around the relationship between this repo and Training Operator, but I'm pretty sure this discussion has been had somewhere else.

Mainly just lack of contributors kubeflow/training-operator#1479

alculquicondor · 2022-11-16T14:30:28Z

Doesn't using ssh require setting up an ssh agent in all workers and launcher, as well as adding keys to the authorized_keys file of each worker?

Yes, but the operator does that for you. You just have to create a compatible image (that has the sshd binary) and write your mpirun command.

alculquicondor · 2022-11-16T14:31:22Z

Also, we know that ssh has its problems too. The long term solution is to use PMIx #12

kimwnasptd closed this as completed Nov 2, 2022

google-oss-prow bot reopened this Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPIJobs and Istio #480

MPIJobs and Istio #480

kimwnasptd commented Nov 2, 2022

kimwnasptd commented Nov 2, 2022

kimwnasptd commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

alculquicondor commented Nov 8, 2022

google-oss-prow bot commented Nov 8, 2022

kimwnasptd commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

kimwnasptd commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

MPIJobs and Istio #480

MPIJobs and Istio #480

Comments

kimwnasptd commented Nov 2, 2022

kimwnasptd commented Nov 2, 2022

kimwnasptd commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

alculquicondor commented Nov 2, 2022

alculquicondor commented Nov 8, 2022

google-oss-prow bot commented Nov 8, 2022

kimwnasptd commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

kimwnasptd commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

alculquicondor commented Nov 16, 2022

alculquicondor commented Nov 16, 2022