[BUG] Subworkflow status is not reported to the parent workflow #5335

pablocasares · 2024-05-08T15:14:23Z

Describe the bug

After upgrading our cluster to v1.12.0 some workflow executions got stuck in "running" state when a subworkflow node failed. It seems that the failed state from the subworkflow is not being reported to the parent workflow so the workflow gets stuck in running state.

We rolled back flytepropeller to v1.11.0 and the issue disappeared. Looks like a race condition issue because it happens from time to time and restarting the workflow seems to mitigate the problem.

Expected behavior

The subworkflow status should be reported back to the parent workflow.

Additional context to reproduce

No response

Screenshots

Logs keeps saying LaunchPlan running, parallelism is now set to [X] - Node Event [failing] seems missing.

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

runllm · 2024-05-08T15:15:07Z

The issue you described, where a subworkflow's failed state is not reported to the parent workflow, causing the parent to remain in a "running" state, is not explicitly covered in the provided data sources. However, similar issues related to subworkflow status reporting and handling have been documented. For instance, there was a problem where a timed-out task in a subworkflow did not cause the workflow to fail (source). Another related issue involved aborted child workflows not properly reporting the root cause of the abort (source).

These cases suggest that Flyte has had challenges with accurately handling and reporting the status of subworkflows under certain conditions. Your specific issue might be related to these known problems, but without explicit mention in the sources, it's difficult to provide a definitive answer or solution based solely on the provided data.

Would you like more detailed information on how similar issues have been addressed in the past?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

kumare3 · 2024-05-08T18:14:49Z

This is a pretty serious bug? cc @pvditt

pvditt · 2024-05-08T19:11:16Z

@pablocasares thank you for pointing this out. Were you able to determine if the workflow was still getting executed by propeller or was it just the console showing it as running? (Looking to see if this is just an eventing/persisting of state to admin bug)

Update: seems it's still running as you're still seeing logs.

pvditt · 2024-05-09T05:22:20Z

@pablocasares My initial thoughts were this had to do with the cache not getting updated but I'm not noticing anything while stepping through the code + looking at the updates between 1.11.0 and 1.12.0

Also to clarify, are you observing this behavior with a parent workflow starting subworkflows or external workflows via launch plans?

Are you able to provide a workflow that could reproduce the error? I'm unable to repro running on flyte sandbox.

pablocasares · 2024-05-09T13:33:35Z

Hi @pvditt, thank you for taking a look at this.

We noticed the issue in workflows that have external subworkflows via launch plans.

We weren't able to reproduce it either because it doesn't happen in every execution. It seems to be happening only in some of the executions of the workflows but those workflows doesn't fail consistently. Due to our high load we hit the case sometimes.

After we downgraded propeller to v1.11.0 yesterday this issue did not happen again and the subworkflow tasks that were stuck on "Running" went to "Failed" as expected.

Also yesterday after the downgrade to v1.11.0 we noticed another issue that might be related with this. I'm not sure if this helps but I will share it just in case. In one workflow execution the subworkflow failed and the parent failed after that but one node got stuck in "Running" state and the error message shown in flyteconsole was:

Workflow[sp-annotation-data:development:sp_annotation_data.workflows.upload_samples_workflow.upload_samples_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: Workflow[sp-annotation-data:development:sp_annotation_data.workflows.upload_samples_workflow.upload_samples_workflow] failed. CausedByError: Failed to propagate Abort for workflow. Error: 0: 0: 0: [system] unable to read futures file, maybe corrupted, caused by: [system] Failed to read futures protobuf file., caused by: path:gs://mybucket/metadata/propeller/wf-dev-4a61a44033f/n6-n5/data/0/futures.pb: not found
1: 0: [system] unable to read futures file, maybe corrupted, caused by: [system] Failed to read futures protobuf file., caused by: path:gs://mybucket/metadata/propeller/wf-dev-4a61a44033f/n6-n5/data/0/futures.pb: not found

Please note that we are on Flyte Admin v1.12.0 and Propeller v1.11.0 and we noticed it just for this case. We can not confirm that this is happening when both versions are in v1.12.0. I'm sharing this just in case it helps you to identify the issue.

Thank you.

pvditt · 2024-05-09T19:22:46Z

@pablocasares thank you for the added info. And just to circle back/confirm,

Were you able to determine if the parent workflow was still getting executed by propeller or was it just the console showing it as running?
When you mention it happens time to time, does that mean you had cases of external/child workflows failing and then the parent workflow correctly handling that state while running propeller on 1.12.0?
When "restarting the workflow seems to mitigate the problem" - with this did you terminate and then relaunch a parent workflow while on propeller 1.12.0?

pablocasares · 2024-05-10T08:56:20Z

Were you able to determine if the parent workflow was still getting executed by propeller or was it just the console showing it as running?

The parent workflow was still getting executed but stuck because it thought that the subworkflow node was still running (you can check the yaml I sent in our internal Slack channel)

When you mention it happens time to time, does that mean you had cases of external/child workflows failing and then the parent workflow correctly handling that state while running propeller on 1.12.0?

Yes, we have workflows with external subworkflows failing that are correctly handled on 1.12.0.

When "restarting the workflow seems to mitigate the problem" - with this did you terminate and then relaunch a parent workflow while on propeller 1.12.0?

Yes, aborted the workflow and then relaunch. As I said, this happen only in some of the executions. The external subworkflow failing is needed for this to happen but having a external subworkflow failing doesn't mean the issue happens. In other words, this happens sometimes when the external subworkflow fails. In some executions the external subworkflow fails and it is handled properly.

pvditt · 2024-05-10T19:35:03Z

@pablocasares thank you for the follow up. Apologies for the mix up - was just added to the slack channel. Let me look back into this.

pvditt · 2024-05-13T06:32:24Z

@pablocasares would you still have access to your propeller logs? If so, can you check if Retrieved Launch Plan status is nil. This might indicate pressure on the admin cache. was getting logged when you noticed the issue w/ propeller v.1.12.0

pablocasares · 2024-05-15T08:54:51Z

Hi again @pvditt, I checked the logs and I can see that message happening when we had propeller v1.12.0 and even now with v1.11.0. Seems to be happening several times per minute.

pvditt · 2024-05-16T08:27:42Z

@pablocasares

I think we've potentially pinned the problem down, but am having difficulty reproducing the race condition. Would you still have access to the flyteadmin logs from when child/external workflow were not propagating status to their parent workflow? Interested to see if you're seeing continued polling to GetExecution for the execution_id of a subworkflow by the admin-launcher's cache update loop.

ie:

2024/05/16 01:11:16 /Users/pauldittamo/src/flyte/flyteadmin/pkg/repositories/gormimpl/execution_repo.go:44
[15.565ms] [rows:1] SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flytesnacks' AND "executions"."execution_domain" = 'development' AND "executions"."execution_name" = 'fh662hcses4ry1' LIMIT 1

Note, there could still be logs showing this occasionally as other parts of Flyte such as console will ping this endpoint. Looking to see if you don't see continued logs at the cadence of ~cache sync cycle duration (defaults to 30s) while the parent workflow is stuck in running.

pablocasares · 2024-05-16T09:58:41Z

Hi @pvditt thanks for the update.

I did a quick search on the logs and I found only 1 line with the parent wf execution id:

SELECT * FROM "executions" WHERE "executions"."execution_project" = 'key-metrics-pipelines' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 'nfu3ftwauvzag7e23pgx' LIMIT 1

I used the same filter with the subworkflow execution id and I couldn't find any line.

I don't see continued logs for the parent execution id (just 1 line) and no logs at all for the child wf.

pablocasares added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 8, 2024

pvditt self-assigned this May 8, 2024

pvditt added exo and removed untriaged This issues has not yet been looked at by the Maintainers labels May 8, 2024

pvditt mentioned this issue May 22, 2024

[BUG] Handle auto refresh cache race condition #5406

Merged

3 tasks

pvditt closed this as completed in #5406 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Subworkflow status is not reported to the parent workflow #5335

[BUG] Subworkflow status is not reported to the parent workflow #5335

pablocasares commented May 8, 2024 •

edited

runllm bot commented May 8, 2024

kumare3 commented May 8, 2024

pvditt commented May 8, 2024 •

edited

pvditt commented May 9, 2024 •

edited

pablocasares commented May 9, 2024

pvditt commented May 9, 2024 •

edited

pablocasares commented May 10, 2024

pvditt commented May 10, 2024 •

edited

pvditt commented May 13, 2024 •

edited

pablocasares commented May 15, 2024

pvditt commented May 16, 2024 •

edited

pablocasares commented May 16, 2024

[BUG] Subworkflow status is not reported to the parent workflow #5335

[BUG] Subworkflow status is not reported to the parent workflow #5335

Comments

pablocasares commented May 8, 2024 • edited

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm bot commented May 8, 2024

kumare3 commented May 8, 2024

pvditt commented May 8, 2024 • edited

pvditt commented May 9, 2024 • edited

pablocasares commented May 9, 2024

pvditt commented May 9, 2024 • edited

pablocasares commented May 10, 2024

pvditt commented May 10, 2024 • edited

pvditt commented May 13, 2024 • edited

pablocasares commented May 15, 2024

pvditt commented May 16, 2024 • edited

pablocasares commented May 16, 2024

pablocasares commented May 8, 2024 •

edited

pvditt commented May 8, 2024 •

edited

pvditt commented May 9, 2024 •

edited

pvditt commented May 9, 2024 •

edited

pvditt commented May 10, 2024 •

edited

pvditt commented May 13, 2024 •

edited

pvditt commented May 16, 2024 •

edited