Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkable: Don't recalculate next_check for remotely generated cr #10011

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Feb 29, 2024

Currently, when processing a CheckResult, it will first trigger an OnNextCheckChanged event, which is sent to all connected endpoints. Then, when Checkable#ProcessCheckResult() returns, an OnNewCheckResult event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the event::SetNextCheck cluster event followed by event::CheckResultand invoke
checkable#SetNextCheck() and Checkable#ProcessCheckResult() with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint calculates it relative to time#now (recomputing the next check relative to the last check/cr#schedule_end does not work for active checks either, as each endpoint randomly initialises its own scheduling offset), the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.

Before

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
94,95c94,95
<                 "next_check": 1709202335.3899999,
<                 "next_update": 1709202350.4348836,
---
>                 "next_check": 1709202334.593121,
>                 "next_update": 1709202349.6380048,
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

After

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

@yhabteab yhabteab added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) area/checks Check execution and results labels Feb 29, 2024
@cla-bot cla-bot bot added the cla/signed label Feb 29, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Feb 29, 2024
@yhabteab yhabteab requested a review from Al2Klimov April 2, 2024 14:33
@Al2Klimov Al2Klimov removed their request for review April 2, 2024 15:14
@yhabteab yhabteab force-pushed the next-check-cluster-sync-issue branch from dac3b95 to 503c2d2 Compare April 4, 2024 09:25
…enrated check

Currently, when processing a `CheckResult`, it will first trigger an
`OnNextCheckChanged` event, which is sent to all connected endpoints.
Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult`
event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the `event::SetNextCheck` cluster
event followed by `event::CheckResult`and invoke
`checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly
received check. So they also try to recalculate the next check
themselves and invalidate the previously received next check timestamp
from the source endpoint. Since each endpoint randomly initialises its
own scheduling offset, the recalculated next check will always differ by
a split second/millisecond on each of them. As a consequence, two Icinga
DB HA instances will generate two different checksums for the same state
and causes the state histories to be fully resynchronised after a
takeover/Icinga 2 reload.
@yhabteab yhabteab force-pushed the next-check-cluster-sync-issue branch from 503c2d2 to c3f27e6 Compare April 4, 2024 09:26
@yhabteab yhabteab requested a review from Al2Klimov April 4, 2024 09:27
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with CRs from command endpoints?

@yhabteab
Copy link
Member Author

yhabteab commented Apr 4, 2024

Does this work with CRs from command endpoints?

How do these CRs differ from any other remote generated ones? I don't see why they shouldn't work.

@Al2Klimov
Copy link
Member

A command endpoint is NOT in the zone of the checkable it checks. So it would send a 'next check changed' message along with the CR, but the master would ignore it: Discarding 'next check changed' message for checkable '...' from '...': Unauthorized access And the master wouldn't update the next check by itself as !origin is false. The CR's origin is the command endpoint.

@julianbrost
Copy link
Contributor

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

@yhabteab
Copy link
Member Author

yhabteab commented Apr 4, 2024

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

Why should the master discard the updates? If it generally does not accept updates from that zone, how do you think the expected CR will be processed then? When that particular endpoint is responsible for executing the checks and generating CRs of a given checkable, under no circumstances will the master reject these updates.

@Al2Klimov
Copy link
Member

  • ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:
    if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable) && endpoint != checkable->GetCommandEndpoint()) {
  • ClusterEvents::NextCheckChangedAPIHandler() allows checkable zone and its parents:
    if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable)) {

@yhabteab
Copy link
Member Author

yhabteab commented Apr 5, 2024

  • ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:

But that receiver doesn't pass the origin to ProcessCheckResult() if the message is a result of a command endpoint.

if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint())
checkable->ProcessCheckResult(cr);

@Al2Klimov
Copy link
Member

Please test it with out-of-zone command endpoint, just to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants