Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exitCode restartPolicy #537

Open
Tracked by #563
Syulin7 opened this issue Mar 3, 2023 · 11 comments
Open
Tracked by #563

Support exitCode restartPolicy #537

Syulin7 opened this issue Mar 3, 2023 · 11 comments
Assignees

Comments

@Syulin7
Copy link

Syulin7 commented Mar 3, 2023

MPIJob doesn't support restartPolicy=ExitCode

ref: kubeflow/training-operator#1768

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

/kind feature

@alculquicondor
Copy link
Collaborator

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

@alculquicondor
Copy link
Collaborator

alculquicondor commented Mar 3, 2023

If we add exitCode, we might be further from this goal kubeflow/training-operator#1718

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Probably, he expects a similar behavior to other training controllers (e.g., tfjob).

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

If we add exitCode, we might be further from this goal kubeflow/training-operator#1718

In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers.

However, it might be worth supporting a similar logic to other CustomJobs in the short term.
This doesn't mean the changes break current logic and API.

WDYT?

@alculquicondor
Copy link
Collaborator

I guess we'll need a new API version anyways.

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

I guess we'll need a new API version anyways.

I don't think we need a new API version since we can support restartPolicy=ExitCode, the same as other CustomJobs, using the following logic:

			// Get the exit code of the container.
			var exitCode int32 = 0xbeef // magic number
			for _, status := range pod.Status.ContainerStatuses {
				state := status.State
				if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
					exitCode = state.Terminated.ExitCode
					logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
					r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
				}
			}
			// Check if the pod is retryable.
			if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
				if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
					failedPodsCount.Inc()
					logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
					if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
						return err
					}
				}
			}

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/reconciler.v1/common/pod.go#L172-L191

If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy.

	// RestartPolicyExitCode policy means that user should add exit code by themselves,
	// The job operator will check these exit codes to
	// determine the behavior when an error occurs:
	// - 1-127: permanent error, do not restart.
	// - 128-255: retryable error, will restart the pod.
	RestartPolicyExitCode RestartPolicy = "ExitCode"

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

I agree.

@alculquicondor
Copy link
Collaborator

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs.

In the short term, I would like to support the restartPolicy=ExitCode, similar to the other CustomJobs in MPIJob v2beta1.
Since if we migrate everything to batch/job, we might need to use Elastic Indexed Job for the elastic training (e.g., elastic horovod).

WDYT?

@alculquicondor
Copy link
Collaborator

sounds good

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

/assign

@tenzen-y tenzen-y mentioned this issue Jun 8, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants