Workflow processing fails to complete due to invalid WorkflowTaskResult from interrupted pod #12993
Open
3 of 4 tasks
Labels
Milestone
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what you expected to happen?
As part of #12402 (included from v3.5.3 onwards), workflow pod
wait
-container behavior was changed to create a placeholder (incomplete) WorkflowTaskResult before waiting for themain
-container to complete.argo-workflows/cmd/argoexec/commands/wait.go
Lines 38 to 42 in 0fdf745
The WorkflowTaskResult is finalized after output artifacts, logs etc. have been handled:
argo-workflows/cmd/argoexec/commands/wait.go
Line 34 in 0fdf745
If the
wait
-container is interrupted in a way that preventsFinalizeOutput
from being called (e.g. pod deletion without sufficient grace period), an incomplete WorkflowTaskResult remains with theworkflows.argoproj.io/report-outputs-completed
label set tofalse
. Retries of the same task will produce additional WorkflowTaskResults and will not mark the previous one complete. This leaves the workflow stuck inProcessing
state until the WorkflowTaskResult is manually edited to mark it complete.The reproduction example workflow simulates forced pod deletion using a pod that deletes itself, leaving behind an incomplete WorkflowTaskResult. The included workflow controller log snippet shows the resulting processing loop.
This issue may be one of the causes of #12103.
Version
v3.5.3
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: