Support exitCode restartPolicy #537

Syulin7 · 2023-03-03T06:06:45Z

MPIJob doesn't support restartPolicy=ExitCode

tenzen-y · 2023-03-03T06:21:41Z

/kind feature

alculquicondor · 2023-03-03T13:43:08Z

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

alculquicondor · 2023-03-03T13:51:57Z

If we add exitCode, we might be further from this goal kubeflow/training-operator#1718

tenzen-y · 2023-03-03T14:02:25Z

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Probably, he expects a similar behavior to other training controllers (e.g., tfjob).

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

If we add exitCode, we might be further from this goal kubeflow/training-operator#1718

In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers.

However, it might be worth supporting a similar logic to other CustomJobs in the short term.
This doesn't mean the changes break current logic and API.

WDYT?

alculquicondor · 2023-03-03T14:30:10Z

I guess we'll need a new API version anyways.

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

tenzen-y · 2023-03-03T14:44:39Z

I guess we'll need a new API version anyways.

I don't think we need a new API version since we can support restartPolicy=ExitCode, the same as other CustomJobs, using the following logic:

			// Get the exit code of the container.
			var exitCode int32 = 0xbeef // magic number
			for _, status := range pod.Status.ContainerStatuses {
				state := status.State
				if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
					exitCode = state.Terminated.ExitCode
					logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
					r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
				}
			}
			// Check if the pod is retryable.
			if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
				if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
					failedPodsCount.Inc()
					logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
					if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
						return err
					}
				}
			}

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/reconciler.v1/common/pod.go#L172-L191

If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy.

	// RestartPolicyExitCode policy means that user should add exit code by themselves,
	// The job operator will check these exit codes to
	// determine the behavior when an error occurs:
	// - 1-127: permanent error, do not restart.
	// - 128-255: retryable error, will restart the pod.
	RestartPolicyExitCode RestartPolicy = "ExitCode"

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

tenzen-y · 2023-03-03T14:45:23Z

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

I agree.

alculquicondor · 2023-03-03T14:45:35Z

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

tenzen-y · 2023-03-03T14:59:23Z

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs.

In the short term, I would like to support the restartPolicy=ExitCode, similar to the other CustomJobs in MPIJob v2beta1.
Since if we migrate everything to batch/job, we might need to use Elastic Indexed Job for the elastic training (e.g., elastic horovod).

WDYT?

alculquicondor · 2023-03-03T15:19:24Z

sounds good

tenzen-y · 2023-03-03T17:33:56Z

/assign

google-oss-prow bot added the kind/feature label Mar 3, 2023

google-oss-prow bot assigned tenzen-y Mar 3, 2023

tenzen-y mentioned this issue Jun 8, 2023

Release 0.5.0 #563

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exitCode restartPolicy #537

Support exitCode restartPolicy #537

Syulin7 commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

alculquicondor commented Mar 3, 2023 •

edited

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

Support exitCode restartPolicy #537

Support exitCode restartPolicy #537

Comments

Syulin7 commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

alculquicondor commented Mar 3, 2023 • edited

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023 •

edited