Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab · 2024-02-29T10:37:52Z

Currently, when processing a CheckResult, it will first trigger an OnNextCheckChanged event, which is sent to all connected endpoints. Then, when Checkable#ProcessCheckResult() returns, an OnNewCheckResult event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the event::SetNextCheck cluster event followed by event::CheckResultand invoke
checkable#SetNextCheck() and Checkable#ProcessCheckResult() with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint calculates it relative to time#now (recomputing the next check relative to the last check/cr#schedule_end does not work for active checks either, as each endpoint randomly initialises its own scheduling offset), the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.

Before

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
94,95c94,95
<                 "next_check": 1709202335.3899999,
<                 "next_update": 1709202350.4348836,
---
>                 "next_check": 1709202334.593121,
>                 "next_update": 1709202349.6380048,
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

After

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

lib/icinga/checkable-check.cpp

…enrated check Currently, when processing a `CheckResult`, it will first trigger an `OnNextCheckChanged` event, which is sent to all connected endpoints. Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult` event is fired, which is of course also sent to all connected endpoints. Next, the other endpoints receive the `event::SetNextCheck` cluster event followed by `event::CheckResult`and invoke `checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint randomly initialises its own scheduling offset, the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.

Al2Klimov

Does this work with CRs from command endpoints?

yhabteab · 2024-04-04T11:36:00Z

Does this work with CRs from command endpoints?

How do these CRs differ from any other remote generated ones? I don't see why they shouldn't work.

Al2Klimov · 2024-04-04T14:13:09Z

A command endpoint is NOT in the zone of the checkable it checks. So it would send a 'next check changed' message along with the CR, but the master would ignore it: Discarding 'next check changed' message for checkable '...' from '...': Unauthorized access And the master wouldn't update the next check by itself as !origin is false. The CR's origin is the command endpoint.

julianbrost · 2024-04-04T14:23:03Z

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

yhabteab · 2024-04-04T15:14:56Z

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

Why should the master discard the updates? If it generally does not accept updates from that zone, how do you think the expected CR will be processed then? When that particular endpoint is responsible for executing the checks and generating CRs of a given checkable, under no circumstances will the master reject these updates.

Al2Klimov · 2024-04-04T16:01:22Z

ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:

icinga2/lib/icinga/clusterevents.cpp

Line 171 in 9e31b8b

if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable) && endpoint != checkable->GetCommandEndpoint()) {
ClusterEvents::NextCheckChangedAPIHandler() allows checkable zone and its parents:

icinga2/lib/icinga/clusterevents.cpp

Line 236 in 9e31b8b

if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable)) {

yhabteab · 2024-04-05T07:15:51Z

ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:

But that receiver doesn't pass the origin to ProcessCheckResult() if the message is a result of a command endpoint.

icinga2/lib/icinga/clusterevents.cpp

Lines 178 to 179 in 9e31b8b

    
           if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint()) 
        
           	checkable->ProcessCheckResult(cr);

Al2Klimov · 2024-04-05T08:02:57Z

Please test it with out-of-zone command endpoint, just to be sure.

yhabteab added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) area/checks Check execution and results labels Feb 29, 2024

yhabteab requested review from julianbrost and Al2Klimov February 29, 2024 10:37

cla-bot bot added the cla/signed label Feb 29, 2024

yhabteab added this to the 2.15.0 milestone Feb 29, 2024

Al2Klimov reviewed Mar 4, 2024

View reviewed changes

lib/icinga/checkable-check.cpp Outdated Show resolved Hide resolved

yhabteab requested a review from Al2Klimov April 2, 2024 14:33

Al2Klimov removed their request for review April 2, 2024 15:14

yhabteab force-pushed the next-check-cluster-sync-issue branch from dac3b95 to 503c2d2 Compare April 4, 2024 09:25

yhabteab force-pushed the next-check-cluster-sync-issue branch from 503c2d2 to c3f27e6 Compare April 4, 2024 09:26

yhabteab requested a review from Al2Klimov April 4, 2024 09:27

Al2Klimov reviewed Apr 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab commented Feb 29, 2024 •

edited

Al2Klimov left a comment

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

julianbrost commented Apr 4, 2024

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

yhabteab commented Apr 5, 2024

Al2Klimov commented Apr 5, 2024

Checkable: Don't recalculate next_check for remotely generated cr #10011

Are you sure you want to change the base?

Checkable: Don't recalculate next_check for remotely generated cr #10011

Conversation

yhabteab commented Feb 29, 2024 • edited

Before

After

Al2Klimov left a comment

Choose a reason for hiding this comment

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

julianbrost commented Apr 4, 2024

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

yhabteab commented Apr 5, 2024

Al2Klimov commented Apr 5, 2024

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab commented Feb 29, 2024 •

edited