Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object stuck in loop with inconsistent status updates and handler failure #1116

Closed
ozlerhakan opened this issue May 15, 2024 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@ozlerhakan
Copy link

ozlerhakan commented May 15, 2024

Long story short

Hi @nolar,

First of all, thank you for the effort you and the contributors have put into this project, providing such a solid framework. We've recently implemented Kopf in prod to manage data for our search clusters.

I wanted to bring up an unusual situation we've encountered. I hope that we can find a way to mitigate it. Our operator uses one timer as the main handler, periodically checking our workflow, along with an update handler for the status.storage=[] field to be stored on S3. Recently, we found ourselves in a situation where our CRD object was stuck in a loop, constantly checking a folder for 7 days in prod. The general workflow involves storing our successful updates details as an object into status.storage=[] with patch.setdefault("status", {})["storage"] and waiting for the operator to store those details to S3 with the help of the status.storage update handler.

Regarding this issue, here are some observations:

  1. The update handler did not trigger for 7 days even though the status.storage was updated with the latest item.
  2. Although we used patch.status within the handler, it seems that the status.storage which we get within the timer as a status parameter did not include the latest item. Instead, it showed the item updated 7 days ago with the same workflow, and this process actually repeated itself for 7 days until we realized the problem.
  3. We are able to suspend the plan's workflow using when= in handlers, but from time to time neither kubectl edit nor kubectl apply -f triggered any message like Handler 'Operator.resume_operator_handler/spec' succeeded. Although, after checking the spec, the changes seemed to be added, the plan was not suspended either. I have noticed that kopf.zalando.org/last-handled-configuration was also pointing to an old configuration.

We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?

The workaround is not desirable from our side, but restarting the operator, deleting the plan and recreating it with a new metadata.name fixed.

Sadly, I have no any useful error or warning messages during this extended period. Please let me know if you need further details about our process.

Thanks!

Kopf version

1.37.1

Kubernetes version

v1.27.12

Python version

3.12

Code

kopf.timer(
    "app.coperator.ai/v1",
    "coperator",
    when=suspend_operation,
)(co.timer_workflow_operator)

kopf.on.update(
    "app.operator.ai/v1",
    "operator",
    field="status.storage",
    when=suspend_operation,
)(storage_manager.update_storage_state)

crs = custom.list_cluster_custom_object(
    group=spec.group,
    version=spec.version,
    plural=spec.plural_kind,
).get("items")

Logs

No response

Additional information

No response

@ozlerhakan ozlerhakan added the bug Something isn't working label May 15, 2024
@ozlerhakan
Copy link
Author

Follow-up on this: I've removed the status.storage handler because kopf.zalando.org/last-handled-configuration was constantly adding the full list of items to this field, causing it to become bloated gradually. I've moved the workflow from this handler to our Timer module. I believe, for our use case this seems to be a decent solution ATM.

We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?

Regarding this case, I've added the watch=False parameter to this method. After testing this behavior in our integration tests, no items are returned from this method after deleting a CRD object and recreating it with different metadata.

I'm closing this issue. If there are any new hiccups, I'll bring them up here. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant