Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the plain pod workers with Indexed Job #613

Open
tenzen-y opened this issue Jan 4, 2024 · 4 comments
Open

Replace the plain pod workers with Indexed Job #613

tenzen-y opened this issue Jan 4, 2024 · 4 comments

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Jan 4, 2024

Part-of: #373

Currently, the mpi-operator manages the plain pod workers. However, the management mechanism is similar to kubernetes batch/job, which is a reinvention of the wheel, although I understand the batch/job didn't have all features to replace the plain pod with batch/job in the past.

Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.

So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.

Let me know what you think. @alculquicondor @terrytangyuan

@alculquicondor
Copy link
Collaborator

Is this something that is happening in the training-operator too?
If not, could it make it harder to merge them in the future? I suppose not, as the plan is to use the mpi-operator as a library in the training-operator, right?

@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 5, 2024

Is this something that is happening in the training-operator too?

Yes, the training-operator has a plan to migrate Indexed Job as well: kubeflow/training-operator#1718

However, we (training-operator) haven't decided yet which ones (using mpi-operator as a library or migrating to Indexed job) we should work on first.

@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 5, 2024

Ah, in the training-operator, the last piece to migrate to the indexed job is JobSuccessPolicy (KEP-3998).

@terrytangyuan
Copy link
Member

Because the Indexed job supports Elastically (Elastic Indexed job) by default since the kubernetes v1.27, even if we replace the plain pod management with Indexed job, we can support MPIJob with elastic semantics like the horovod.

This is great. Good to know that elastic semantics can be maintained.

So, I would propose replacing the plain pod workers with Indexd Job after the kubernetes v1.26 (EoL: 2024-02-28) has been deprecated.

I am ok with the timeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants