Drift detection and Continuous Reconciliation #4895

kutsyk · 2024-04-30T16:12:14Z

I'm trying to figure out how Karmada is solving problems of drift detection and continuous reconciliation.

It is not clear for me how to do next things:

How to check if all propagation policies and override policies has been rolled out and applied correctly?
How to detect if one of the managed cluster failed to apply propagation/override policy?
What metrics to use to monitor is all propagated components work in all managed clusters?

I believe these 3 questions are not something new and most of people who are using orchestration tool should have already answered them, but I can't get my head around this with Karmada.

Thanks ahead for the help

XiShanYongYe-Chang · 2024-05-06T03:03:19Z

How to check if all propagation policies and override policies has been rolled out and applied correctly?

PP and OP need to be separated. For PP, it is available to view the FullyApplied condition status of ResourceBinding. For OP, it is necessary to actively check whether the differentiated configuration takes effect in the Work resource. In general, the current solution is still defective for automated checks, and status checks cannot be performed directly from the API.

How to detect if one of the managed cluster failed to apply propagation/override policy?

For PP and OP, the relationship between them and resources is only matched or not matched. When the matching is successful, they will be reflected in ResourceBinding and Work resources respectively.

Then, the resource template in the work is synchronized to the member cluster. You can view the synchronization result in the work status.

What metrics to use to monitor is all propagated components work in all managed clusters?

You can check the value of the health field in Work status:

karmada/pkg/apis/work/v1alpha1/work_types.go

Line 107 in 5e1191f

Health ResourceHealth `json:"health,omitempty"`

User can use the custom interpreter InterpretHealth to define this value.

RainbowMango · 2024-05-06T10:52:08Z

Just out of curiosity, are you evaluating Karmada? Do you have a schedule or something? We can see how to support you better.

kutsyk · 2024-05-06T11:56:23Z

Hi,
@XiShanYongYe-Chang , thanks for the clarification.

@RainbowMango , yes, I'm evaluating the tool around set of points and to better understand how it works and if we should use it. We are at the end of our schedule and the last points I have to understand are those described in my initial question.

Seems there is a way to monitor statuses, but only through objects and values in their fields. There is no data exposed as metrics, do I understand this correctly?

RainbowMango · 2024-05-06T13:17:17Z

We have some metrics exposed by /metrics endpoint, but not include the items you want.
I would say that metrics can be added at any time when there is a need.

kutsyk · 2024-05-08T09:40:48Z

@RainbowMango , @XiShanYongYe-Chang , do you have idea on how to identify if OverridePolicy failed and how to find reasons for it?

Also, what will happen if override policy that should be applied to 5 cluster fails on 2, how can I identify those clusters and reason for error?

Thanks

XiShanYongYe-Chang · 2024-05-08T10:21:35Z

Hi @kutsyk, oo you mean the op apply failure is caused by the op write problem or the failure of synchronization to the member cluster?

kutsyk · 2024-05-08T10:23:48Z

Hi, The failure of synchronisation to the member cluster Kutsyk Vasyl

…

On Wed, 8 May 2024 at 12:21, Chang ***@***.***> wrote: Hi @kutsyk <https://github.com/kutsyk>, oo you mean the op apply failure is caused by the op write problem or the failure of synchronization to the member cluster? — Reply to this email directly, view it on GitHub <#4895 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSIWHTGF6LMVDYIX23HZVLZBH4ELAVCNFSM6AAAAABHASLMACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGI2TCOJSGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

XiShanYongYe-Chang · 2024-05-09T01:35:12Z

In my opinion, if the resource synchronization fails, it is not easy to determine whether the failure is caused by the OP. In other words, the two are decoupled. The OP is used for the work resource, and the resource synchronization failure is a further operation on the work resource. Can you give an example of a failure caused by the OP?

kutsyk added the kind/question Indicates an issue that is a support question. label Apr 30, 2024

kutsyk mentioned this issue May 2, 2024

How to detect deviation from the baseline and alert based on it #4861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drift detection and Continuous Reconciliation #4895

Drift detection and Continuous Reconciliation #4895

kutsyk commented Apr 30, 2024

XiShanYongYe-Chang commented May 6, 2024 •

edited

RainbowMango commented May 6, 2024

kutsyk commented May 6, 2024

RainbowMango commented May 6, 2024

kutsyk commented May 8, 2024

XiShanYongYe-Chang commented May 8, 2024

kutsyk commented May 8, 2024 via email

XiShanYongYe-Chang commented May 9, 2024

Drift detection and Continuous Reconciliation #4895

Drift detection and Continuous Reconciliation #4895

Comments

kutsyk commented Apr 30, 2024

XiShanYongYe-Chang commented May 6, 2024 • edited

RainbowMango commented May 6, 2024

kutsyk commented May 6, 2024

RainbowMango commented May 6, 2024

kutsyk commented May 8, 2024

XiShanYongYe-Chang commented May 8, 2024

kutsyk commented May 8, 2024 via email

XiShanYongYe-Chang commented May 9, 2024

XiShanYongYe-Chang commented May 6, 2024 •

edited