Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute global health checks only on the local instance #225

Open
1 task done
s4heid opened this issue Dec 24, 2022 · 0 comments
Open
1 task done

Execute global health checks only on the local instance #225

s4heid opened this issue Dec 24, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@s4heid
Copy link
Contributor

s4heid commented Dec 24, 2022

Is there an existing issue for this?

  • I have searched the existing issues

The Problem

When NeonBee runs clustered, health checks are divided in node-specific and global health checks. If health information is requested via the HealthCheckHandler, node-specific checks are executed on every single node and the results will be consolidated in the HealthCheckRegistry. The current implementation, however, also executes the global checks in the same way as there is no differentiation between the type of check. All check results of a global check will be compared in the consolidateResults(...) method of the HealthCheckRegistry, and only the first check will be added to the list of consolidated checks (which will be returned by the HealthCheckHandler).

This is a problem, because when having a large cluster, global checks are executed on every single node in parallel. Depending, on the type of check (e.g. health request to an external service) this can cause high load on the external services.

Another issue with this implementation is that the NeonBee nodes not necessarily share the same configuration and thus might not be able to perform the health check. In that case, the consolidateResults method has redundant data. Therefore, we need to make it clear to the user where this check is performed, such that the required configuration can be set-up.

Desired Solution

A better implementation would only execute the global check once. It would be sufficient if the check is executed on the local node which invokes the health check handler. I think - for now - we do not have make it configurable on which node the check is executed, but this might be something we could keep in mind for the future in case there is demand.

Alternative Solutions

No response

Additional Context

To give more detail about the current implementation, here is some log output which would be generated in the HealthCheckRegistry.sendDataRequests(...), when logging the data object returned by the invoked data request of each HealthCheckVerticle. Assuming there is 3 verticles in a cluster with a global check service.feature-flags.health,

Retrieved check from neonbee/_healthCheckVerticle-5404d541-7fe0-44a8-bed4-c60af882453b with data: [ {
"id" : "cluster.hazelcast",
"status" : "UP",
"data" : {
"clusterState" : "ACTIVE",
"clusterSize" : 3,
"lifecycleServiceState" : "ACTIVE"
}
}, {
"id" : "service.feature-flags.health",
"status" : "UP",
"data" : {
"statusCode" : 200,
"latencyMillis" : 23,
"statusMessage" : "UP"
}
} ]

Retrieved check from neonbee/_healthCheckVerticle-0ac2c75e-7e63-4e29-b919-1304b023521b with data: [ {
"id" : "cluster.hazelcast",
"status" : "UP",
"data" : {
"clusterState" : "ACTIVE",
"clusterSize" : 3,
"lifecycleServiceState" : "ACTIVE"
}
}, {
"id" : "service.feature-flags.health",
"status" : "UP",
"data" : {
"statusCode" : 200,
"latencyMillis" : 26,
"statusMessage" : "UP"
}
} ]

Retrieved check from neonbee/_healthCheckVerticle-69fc018f-eaaf-499f-acce-82a0752dc919 with data: [ {
"id" : "cluster.hazelcast",
"status" : "UP",
"data" : {
"clusterState" : "ACTIVE",
"clusterSize" : 3,
"lifecycleServiceState" : "ACTIVE"
}
}, {
"id" : "service.feature-flags.health",
"status" : "DOWN",
"data" : {
"cause" : "Could not fetch credentials for basic authentication"
}
} ]

Here, NeonBee would report always the status of the HealthCheckVerticle which registered first in the shared map. The other check results are discarded. Also, notice that the node which runs neonbee/_healthCheckVerticle-69fc018f-eaaf-499f-acce-82a0752dc919 is not setup to authenticate against the service. If this verticle registered first, this failing status would always be returned.

@s4heid s4heid added the enhancement New feature or request label Dec 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant