Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow processing fails to complete due to invalid WorkflowTaskResult from interrupted pod #12993

Open
3 of 4 tasks
jesseanttila-cai opened this issue Apr 29, 2024 · 3 comments · May be fixed by #13051
Open
3 of 4 tasks
Labels
area/executor type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@jesseanttila-cai
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

As part of #12402 (included from v3.5.3 onwards), workflow pod wait-container behavior was changed to create a placeholder (incomplete) WorkflowTaskResult before waiting for the main-container to complete.

// Create a new empty (placeholder) task result with LabelKeyReportOutputsCompleted set to false.
wfExecutor.InitializeOutput(bgCtx)
// Wait for main container to complete
err := wfExecutor.Wait(ctx)

The WorkflowTaskResult is finalized after output artifacts, logs etc. have been handled:

defer wfExecutor.FinalizeOutput(bgCtx) // Ensures the LabelKeyReportOutputsCompleted is set to true.

If the wait-container is interrupted in a way that prevents FinalizeOutput from being called (e.g. pod deletion without sufficient grace period), an incomplete WorkflowTaskResult remains with the workflows.argoproj.io/report-outputs-completed label set to false. Retries of the same task will produce additional WorkflowTaskResults and will not mark the previous one complete. This leaves the workflow stuck in Processing state until the WorkflowTaskResult is manually edited to mark it complete.

The reproduction example workflow simulates forced pod deletion using a pod that deletes itself, leaving behind an incomplete WorkflowTaskResult. The included workflow controller log snippet shows the resulting processing loop.

This issue may be one of the causes of #12103.

Version

v3.5.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-hang-example-
spec:
  entrypoint: self-delete
  templates:
    - name: self-delete
      retryStrategy:
        limit: "1"
        retryPolicy: Always
      podSpecPatch: |
        containers:
          - name: main
            env:
            - name: RETRY_COUNT
              value: "{{retries}}"
        terminationGracePeriodSeconds: 0
      container:
        image: bitnami/kubectl:latest
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
        # Simulate interruption via forced pod deletion on first attempt, complete successfully on retry
        command: ["sh", "-c", "sleep 10; [ $RETRY_COUNT -eq 0 ] && kubectl delete pod $POD_NAME; sleep 10"]

Logs from the workflow controller

time="2024-04-29T10:31:58.427Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=26673983 namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.428Z" level=info msg="Task-result reconciliation" namespace=ext-namespace numObjs=2 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{workflow-hang-example-hf6nh-1013521106  ext-namespace  7ad71c97-1148-4c59-aea4-84e8c56d19c4 26673606 1 2024-04-29 10:29:57 +0000 UTC <nil> <nil> map[workflows.argoproj.io/report-outputs-completed:false workflows.argoproj.io/workflow:workflow-hang-example-hf6nh] map[] [{argoproj.io/v1alpha1 Workflow workflow-hang-example-hf6nh af4266ff-e25f-4e4d-b073-dee4b6778da7 <nil> <nil>}] [] [{argoexec Update argoproj.io/v1alpha1 2024-04-29 10:29:57 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"af4266ff-e25f-4e4d-b073-dee4b6778da7\\\"}\":{}}}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:nil,Progress:,},}" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result name:\nworkflow-hang-example-hf6nh-1013521106" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Marking task result incomplete workflow-hang-example-hf6nh-1013521106" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=info msg="task-result changed" namespace=ext-namespace nodeID=workflow-hang-example-hf6nh-1013521106 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{workflow-hang-example-hf6nh-3630579407  ext-namespace  3fe40d94-0f39-45ee-aff6-902f6086b578 26673958 2 2024-04-29 10:30:36 +0000 UTC <nil> <nil> map[workflows.argoproj.io/report-outputs-completed:true workflows.argoproj.io/workflow:workflow-hang-example-hf6nh] map[] [{argoproj.io/v1alpha1 Workflow workflow-hang-example-hf6nh af4266ff-e25f-4e4d-b073-dee4b6778da7 <nil> <nil>}] [] [{argoexec Update argoproj.io/v1alpha1 2024-04-29 10:30:57 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"af4266ff-e25f-4e4d-b073-dee4b6778da7\\\"}\":{}}},\"f:outputs\":{}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:&Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},Progress:,},}" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result name:\nworkflow-hang-example-hf6nh-3630579407" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Marking task result complete workflow-hang-example-hf6nh-3630579407" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=info msg="task-result changed" namespace=ext-namespace nodeID=workflow-hang-example-hf6nh-3630579407 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Skipping artifact GC" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Evaluating node workflow-hang-example-hf6nh: template: *v1alpha1.WorkflowStep (self-delete), boundaryID: " namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.429Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.429Z" level=debug msg="Getting the template by name: self-delete" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.430Z" level=debug msg="Node workflow-hang-example-hf6nh already completed" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=info msg="TaskSet Reconciliation" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=info msg=reconcileAgentPod namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=debug msg="Task results completion status: map[workflow-hang-example-hf6nh-1013521106:false workflow-hang-example-hf6nh-3630579407:true]" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=ext-namespace toPhase=Succeeded workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.443Z" level=info msg="Workflow update successful" namespace=ext-namespace phase=Running resourceVersion=26673983 workflow=workflow-hang-example-hf6nh

Logs from in your workflow's wait container

N/A (wait container must be forcibly interrupted for this issue to appear)
@jesseanttila-cai jesseanttila-cai added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Apr 29, 2024
@jswxstw
Copy link
Member

jswxstw commented Apr 29, 2024

It seems unreliable to determine whether the task result has been successfully reported through the AnnotationKeyReportOutputsCompleted.

@shuangkun
Copy link
Member

After reproducing it, there are indeed some problems and need to think about how to fix it.

@alexlatchford
Copy link

Yeah we are just concluding an RCA after we had a cohort of workflows "stuck" in Running state as the upgrade of the workflow-controller was rolled out from v3.4.16 to v3.5.6. We had the same conclusion that likely #11947 was the root cause, but #12402 seems related too! In either case looks like it's isolated to a transient issue on our end due to incompatibility between argoexec:v3.4.16 sidecars inside the active pods of workflows running at the time of the upgrade and the new workflow-controller:v3.5.6 control plane! Thanks for the original report and the maintainers for all the effort here!

@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 13, 2024
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue May 14, 2024
…. Fixes:argoproj#12993

Signed-off-by: shuangkun <tsk2013uestc@163.com>
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue May 14, 2024
…. Fixes:argoproj#12993

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/executor type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
5 participants