-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPIJobs and Istio #480
Comments
I'll actually create issue in the training-operator repo, since IIUC this controller's code has been migrated in that repo |
Closing this in favor of kubeflow/training-operator#1681 |
It's hasn't been fully migrated. There is a new version of the controller that uses ssh instead of exec, which might be a step in the right direction. We discussed this topic before #429, but there is no solution. |
Probably the solution is along the lines of istio using a TCP proxy, instead of an HTTPS proxy. |
/reopen Have you tried with the v2 controller? |
@alculquicondor: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Orthogonal to the technical discussion, but shouldn't we close this issue and keep this context on kubeflow/training-operator#1681? Why was this issue re-opened? Isn't the main development of MPI Operator now happening in Training Operator repo? |
Not all of it. The v2 operator hasn't been migrated yet. |
Then on the technical aspect
Doesn't using ssh require setting up an ssh agent in all workers and launcher, as well as adding keys to the
Could you clarify a little bit what you mean here? Are you referring to the sidecars? Which HTTPs proxy did you have in mind?
How can I try this? Are the manfiests for this controller part of this repo? I currently have a KF cluster, so can I have both these standalone manifests and Training Operator at the same time? Also I have some more questions around the relationship between this repo and Training Operator, but I'm pretty sure this discussion has been had somewhere else. Could you point me to any docs or issues? If we don't have any I can raise a new issue, to not pollute this one with that discussion |
Using
TBH, I'm not familiar with istio. I picked up that suggestion from #429.
Yes, they are in this repo. I don't think you can have both running at the same time. IIUC, there is a way to disable specific operators in training-operator, but I'm not familiar.
Mainly just lack of contributors kubeflow/training-operator#1479 |
Yes, but the operator does that for you. You just have to create a compatible image (that has the sshd binary) and write your mpirun command. |
Also, we know that ssh has its problems too. The long term solution is to use PMIx #12 |
I tried to run some MPIJobs with Istio enabled in the user namespaces but have bumped into a couple of issues. I'll use this issue to expose the bugs that occurred as well as proposed solutions. Although we might need to break this into smaller issues.
I used the
tensorflow-benchmarks
example, so this will be my point of reference.The problems we've observed are the following:
proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}'
annotation on the podskubectl exec
which can use the sidecarkubectl.kubernetes.io/default-container
annotation in the worker podsApologies if there were duplicate issues for Istio and MPIJob. Are there any plans for mitigating some of these issues and ensuring MPIJob can work with Istio?
Also if I need to open this issue in
kubeflow/training-operator
please tell me and I'll open a new one.The text was updated successfully, but these errors were encountered: