Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drift detection and Continuous Reconciliation #4895

Open
kutsyk opened this issue Apr 30, 2024 · 8 comments
Open

Drift detection and Continuous Reconciliation #4895

kutsyk opened this issue Apr 30, 2024 · 8 comments
Labels
kind/question Indicates an issue that is a support question.

Comments

@kutsyk
Copy link

kutsyk commented Apr 30, 2024

I'm trying to figure out how Karmada is solving problems of drift detection and continuous reconciliation.

It is not clear for me how to do next things:

  1. How to check if all propagation policies and override policies has been rolled out and applied correctly?
  2. How to detect if one of the managed cluster failed to apply propagation/override policy?
  3. What metrics to use to monitor is all propagated components work in all managed clusters?

I believe these 3 questions are not something new and most of people who are using orchestration tool should have already answered them, but I can't get my head around this with Karmada.

Thanks ahead for the help

@kutsyk kutsyk added the kind/question Indicates an issue that is a support question. label Apr 30, 2024
@XiShanYongYe-Chang
Copy link
Member

XiShanYongYe-Chang commented May 6, 2024

How to check if all propagation policies and override policies has been rolled out and applied correctly?

PP and OP need to be separated. For PP, it is available to view the FullyApplied condition status of ResourceBinding. For OP, it is necessary to actively check whether the differentiated configuration takes effect in the Work resource. In general, the current solution is still defective for automated checks, and status checks cannot be performed directly from the API.

How to detect if one of the managed cluster failed to apply propagation/override policy?

For PP and OP, the relationship between them and resources is only matched or not matched. When the matching is successful, they will be reflected in ResourceBinding and Work resources respectively.

Then, the resource template in the work is synchronized to the member cluster. You can view the synchronization result in the work status.

What metrics to use to monitor is all propagated components work in all managed clusters?

You can check the value of the health field in Work status:

Health ResourceHealth `json:"health,omitempty"`

User can use the custom interpreter InterpretHealth to define this value.

@RainbowMango
Copy link
Member

Just out of curiosity, are you evaluating Karmada? Do you have a schedule or something? We can see how to support you better.

@kutsyk
Copy link
Author

kutsyk commented May 6, 2024

Hi,
@XiShanYongYe-Chang , thanks for the clarification.

@RainbowMango , yes, I'm evaluating the tool around set of points and to better understand how it works and if we should use it. We are at the end of our schedule and the last points I have to understand are those described in my initial question.

Seems there is a way to monitor statuses, but only through objects and values in their fields. There is no data exposed as metrics, do I understand this correctly?

@RainbowMango
Copy link
Member

We have some metrics exposed by /metrics endpoint, but not include the items you want.
I would say that metrics can be added at any time when there is a need.

@kutsyk
Copy link
Author

kutsyk commented May 8, 2024

@RainbowMango , @XiShanYongYe-Chang , do you have idea on how to identify if OverridePolicy failed and how to find reasons for it?

Also, what will happen if override policy that should be applied to 5 cluster fails on 2, how can I identify those clusters and reason for error?

Thanks

@XiShanYongYe-Chang
Copy link
Member

Hi @kutsyk, oo you mean the op apply failure is caused by the op write problem or the failure of synchronization to the member cluster?

@kutsyk
Copy link
Author

kutsyk commented May 8, 2024 via email

@XiShanYongYe-Chang
Copy link
Member

In my opinion, if the resource synchronization fails, it is not easy to determine whether the failure is caused by the OP. In other words, the two are decoupled. The OP is used for the work resource, and the resource synchronization failure is a further operation on the work resource. Can you give an example of a failure caused by the OP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

3 participants