Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte array plugin fails with "object has been modified" #5273

Open
2 tasks done
pablocasares opened this issue Apr 23, 2024 · 3 comments
Open
2 tasks done

[BUG] Flyte array plugin fails with "object has been modified" #5273

pablocasares opened this issue Apr 23, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@pablocasares
Copy link

pablocasares commented Apr 23, 2024

Describe the bug

Flyte array plugin tasks fail because it seems that the pod is being modified externally and then the pod information stored in the propeller does not correspond to the actual one in the cluster.

The error is:

Workflow[ingestion-pipeline:production:ingestion_pipeline.ingestion.ingestion_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: Operation cannot be fulfilled on pods "kac2sxex6kvectdvx3vk-n3-0-n1-0-1195": the object has been modified; please apply your changes to the latest version and try again

It seems that the system is retrying that 50 times but I think that the new pod information is never fetched again from the cluster. Retrying 50 times will not help in that case.

Maybe there's a missing Pod.Get() around these lines: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/array/k8s/subtask.go#L108-L141

resourceToFinalize seems to be always an empty skeleton because there's no Get operation done on the actual pod.

Expected behavior

If the pod changes externally the plugin should detect that and properly refresh the object while doing the retries.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@pablocasares pablocasares added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 23, 2024
Copy link

runllm bot commented Apr 23, 2024

The issue you're encountering with the Flyte array plugin, where tasks fail due to external modifications to the pod, is related to the handling of Kubernetes resources by Flyte's propeller. This problem typically arises when there are concurrent updates to the same Kubernetes resource, possibly from different controllers or manual interventions.

In the context of Flyte, when a pod is modified externally, the stored version in FlytePropeller becomes outdated, leading to conflicts. The error message you see indicates that the system has tried to apply changes based on an outdated version of the pod, which Kubernetes rejects to prevent unintended overwrites.

The solution might involve implementing a mechanism to fetch the latest pod state from the Kubernetes API before retrying operations, ensuring that FlytePropeller operates on the current state of the pod. This could be achieved by adding a Pod.Get() operation, as you suggested, to refresh the pod's state in the Flyte system before attempting retries or updates.

This issue is not explicitly covered in the provided data sources, so the exact implementation details or confirmation of this behavior in Flyte's codebase cannot be verified from the sources. However, the general behavior of Kubernetes and the pattern of errors you described align with common issues faced in systems interacting with Kubernetes resources under concurrent modifications.

Would you like more detailed guidance on how to potentially implement this fix or further investigate this issue in the Flyte codebase?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@kumare3
Copy link
Contributor

kumare3 commented Apr 24, 2024

We should enable finalizers maybe

@pablocasares
Copy link
Author

We are already setting

inject-finalizer: true
in our setup.

Would the get() operation I mentioned help there?

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants