fix bug about status absence when worker pod spec is invalid #606

congpeiqing · 2023-11-17T14:53:05Z

close #604

When a worker pod fails to create, the current practice is to retry later. However, retrying does not solve the issue if the failure is due to an invalid Pod Spec. In this PR , I try to check the failure reason first and if it is due to an invalid Pod Spec, just update the Job's status to "Failed" without any retries.

google-oss-prow · 2023-11-17T14:53:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan

This looks fine as a fix to unblock the issue. Any thoughts? @alculquicondor @tenzen-y

alculquicondor · 2023-11-20T20:19:36Z

pkg/controller/mpi_job_controller.go

@@ -961,8 +961,13 @@ func (c *MPIJobController) getOrCreateWorker(mpiJob *kubeflow.MPIJob) ([]*corev1
 		// If an error occurs during Get/Create, we'll requeue the item so we
 		// can attempt processing again later. This could have been caused by a
 		// temporary network failure, or any other transient reason.
+		// But, if err is about pod spec invalid, retrying would be
+		// futile, the status of job should turn to failed.
 		if err != nil {
 			c.recorder.Eventf(mpiJob, corev1.EventTypeWarning, mpiJobFailedReason, "worker pod created failed: %v", err)


This is only one of the cases where there could be an invalid Pod template.

It might be better to return this error and handle more generically in syncHandler, so we can handle the launcher pod, the worker pods and any other validation errors:

mpi-operator/pkg/controller/mpi_job_controller.go

Line 562 in 6748c47

if errs := validation.ValidateMPIJob(mpiJob); len(errs) != 0 {

I have examined how Pod Spec validation is performed in the Kubernetes project. The relevant code can be found in the "k8s.io/kubernetes/pkg/apis/core/validation" package.
However, it seems that this package is not usable outside of the Kubernetes project

I didn't mean that you should use the validation code form kubernetes.

I just mean that there are multiple cases in which we can't retry, and this PR is only covering one of them.

fix bug cant get status when worker pod spec is invalid

6748c47

google-oss-prow bot requested review from gaocegege and zw0610 November 17, 2023 14:53

google-oss-prow bot added the size/S label Nov 17, 2023

congpeiqing mentioned this pull request Nov 17, 2023

Cant get mpijob status when pod template is invalid #604

Open

terrytangyuan reviewed Nov 20, 2023

View reviewed changes

alculquicondor reviewed Nov 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug about status absence when worker pod spec is invalid #606

fix bug about status absence when worker pod spec is invalid #606

congpeiqing commented Nov 17, 2023

google-oss-prow bot commented Nov 17, 2023

terrytangyuan left a comment

alculquicondor Nov 20, 2023

tenzen-y Nov 21, 2023

congpeiqing Nov 21, 2023

alculquicondor Nov 21, 2023

fix bug about status absence when worker pod spec is invalid #606

Are you sure you want to change the base?

fix bug about status absence when worker pod spec is invalid #606

Conversation

congpeiqing commented Nov 17, 2023

google-oss-prow bot commented Nov 17, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

alculquicondor Nov 20, 2023

Choose a reason for hiding this comment

tenzen-y Nov 21, 2023

Choose a reason for hiding this comment

congpeiqing Nov 21, 2023

Choose a reason for hiding this comment

alculquicondor Nov 21, 2023

Choose a reason for hiding this comment