Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug about status absence when worker pod spec is invalid #606

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

congpeiqing
Copy link

close #604

When a worker pod fails to create, the current practice is to retry later. However, retrying does not solve the issue if the failure is due to an invalid Pod Spec. In this PR , I try to check the failure reason first and if it is due to an invalid Pod Spec, just update the Job's status to "Failed" without any retries.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine as a fix to unblock the issue. Any thoughts? @alculquicondor @tenzen-y

@@ -961,8 +961,13 @@ func (c *MPIJobController) getOrCreateWorker(mpiJob *kubeflow.MPIJob) ([]*corev1
// If an error occurs during Get/Create, we'll requeue the item so we
// can attempt processing again later. This could have been caused by a
// temporary network failure, or any other transient reason.
// But, if err is about pod spec invalid, retrying would be
// futile, the status of job should turn to failed.
if err != nil {
c.recorder.Eventf(mpiJob, corev1.EventTypeWarning, mpiJobFailedReason, "worker pod created failed: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only one of the cases where there could be an invalid Pod template.

It might be better to return this error and handle more generically in syncHandler, so we can handle the launcher pod, the worker pods and any other validation errors:

if errs := validation.ValidateMPIJob(mpiJob); len(errs) != 0 {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have examined how Pod Spec validation is performed in the Kubernetes project. The relevant code can be found in the "k8s.io/kubernetes/pkg/apis/core/validation" package.
However, it seems that this package is not usable outside of the Kubernetes project

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean that you should use the validation code form kubernetes.

I just mean that there are multiple cases in which we can't retry, and this PR is only covering one of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cant get mpijob status when pod template is invalid
4 participants