Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizate jobflow controller to reduce invalid reconcile #3441

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

calvin0327
Copy link
Contributor

@calvin0327 calvin0327 commented Apr 26, 2024

I found a some bit err message when using jobflow feature, I create a jobflow resource ref:
https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobFlow.yaml
https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobTemplate.yaml

here's controller manager logs:

[root@master01 ~]# kubectl logs -n volcano-system volcano-controllers-744bc4796d-jbncj | grep ^E
E0425 10:34:49.690189       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:49.707411       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:50.321009       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:34:51.395417       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:04.721574       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:04.736015       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.568771       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.581852       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:20.711708       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:21.731150       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:34.692296       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.695945       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.698687       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.701790       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.707817       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.712693       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.714371       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.715187       1 jobflow_controller_action.go:300] Failed to delete job of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.715210       1 jobflow_controller_action.go:46] Failed to delete jobs of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.717377       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.723456       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-a> is not ready
E0425 10:35:34.728548       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/test-a>

The pr focuses only on jobflow_controllers.go errors.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 26, 2024
Signed-off-by: calvin <wen.chen@daocloud.io>
@calvin0327 calvin0327 force-pushed the optimizate-workflow-controller branch from fb66ac0 to a6fad98 Compare April 26, 2024 09:01
@@ -63,7 +65,26 @@ func (jf *jobflowcontroller) syncJobFlow(jobFlow *v1alpha1flow.JobFlow, updateSt
}
jobFlow.Status = *jobFlowStatus
updateStateFn(&jobFlow.Status, len(jobFlow.Spec.Flows))
_, err = jf.vcClient.FlowV1alpha1().JobFlows(jobFlow.Namespace).UpdateStatus(context.Background(), jobFlow, metav1.UpdateOptions{})

err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the retry mechanism after resource version conflicts to avoid the next reconcile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's truly a problem, but this way seems will be more time consuming.

@@ -281,6 +302,12 @@ func (jf *jobflowcontroller) deleteAllJobsCreatedByJobFlow(jobFlow *v1alpha1flow
for _, job := range jobList {
err := jf.vcClient.BatchV1alpha1().Jobs(jobFlow.Namespace).Delete(context.Background(), job.Name, metav1.DeleteOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingnore this error return if the job no longer exist.

@calvin0327
Copy link
Contributor Author

/auto-cc

@calvin0327
Copy link
Contributor Author

@lowang-bh @hwdef PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants