New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: cluster-health API path and patronictl command #1452
base: master
Are you sure you want to change the base?
Conversation
if m.name == leader_name: | ||
continue | ||
if m.data.get('timeline', '') != leader_tl or int(m.data.get('lag', 0)) > maximum_lag_on_failover: | ||
logger.warning('cluster is not healthy: timeline mismatch in member %s', m.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning is misleading, while in fact it could be on the same timeline but replication is lagging.
It is also possible that the replica is tagged with nofailover
and recovery_min_apply_delay
is set to some high values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the former, I would say the cluster isn't very healthy in this case, but the warning message could be improved to add the max log thing.
About the latter, I will need to think a bit more about this; maybe in this case the user cannot expect this feature to work correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to check for the nofailover
tag now and if it is set, not consider replication lag on that node to render the cluster unhealthy. However, I had a hard time testing this as Patroni apparently uses sent_lsn
or flush_lsn
for the lag and not apply_lan
(which can easily be simulated via recovery_min_apply_delay
)
Also, during testing, it seemed like Patroni would assign Sync Standby
to nodes with tag nofailover: true
, is that possible? Maybe I broke something in my branch, I'll test it on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds an API path /cluster_health that returns 200 if the overall cluster is considered healthy and 503 if not. Also, a `cluster-health' command is added that exposes this in patronictl. If the cluster is not healthy, patronictl exits with exit code 1.
It could be that a replica is not replicating properly from the leader, however, this was not directly caught so far (only when the lag was getting too large). To improve the cluster-health check, get the leader's `member' API response and check how many rows the `replication' object has. If that number is different to the number of replicas, consider the cluster as being unhealthy. In passing, also log the reason for the failing cluster-health at warning level.
Only set it to replica if node is in archive recovery, set it to empty otherwise. In cluster_as_json(), return the actual role and not just replica for all non-leader nodes. Finally, expclitly print replica nodes as 'Follower' in patronictl list.
If a leader exists, but not all members are healthy, exit with exit status 1. If no leader exists, exit with exit status 2. This is supposed to be analogous to Nagios' WARNING (exit status 1) and CRITICAL (exit status 2).
…ss them on in the API
Hrm, that merge looks weird |
That looks better after a force-push |
patroni/utils.py
Outdated
if 'tags' in m.data and 'nofailover' in m.data.get('tags'):#, {}).items(): | ||
nofailover = True | ||
if int(m.data.get('lag', 0)) > maximum_lag_on_failover and not nofailover: | ||
logger.warning('cluster is not healthy: replication lag in member %s', m.name) | ||
return 500 | ||
follower_roles = ("replica", "sync_standby") | ||
if m.data.get('role', '') not in follower_roles or m.data.get('state', '') != 'running': | ||
logger.warning('cluster is not healthy: member %s does not have follower role or is not in state running', m.name) | ||
return 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no lag
in Member
objects. It seems that you mixed something up with the cluster object build by cluster_as_json()
:
Lines 412 to 418 in 47ebda0
if m.name == leader_name: | |
config = cluster.config.data if cluster.config and cluster.config.modify_index else {} | |
role = 'standby_leader' if is_standby_cluster(config.get('standby_cluster')) else 'leader' | |
elif m.name in cluster.sync.members: | |
role = 'sync_standby' | |
else: | |
role = 'replica' |
The same about role
. In the Member object it is never "sync_standby", but cluster_as_json()
sets it for non-leader:
Lines 431 to 436 in 47ebda0
if lsn is None: | |
member['lag'] = 'unknown' | |
elif cluster_lsn >= lsn: | |
member['lag'] = cluster_lsn - lsn | |
else: | |
member['lag'] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About lag
, I tried to address that in credativ@259fe82
About role being sync_standby
, you complained about role possibly being sync_standby
in an earlier review in March 2020 - should I revert credativ@dc087e7
259fe82
to
ef428e7
Compare
This fixes the call to get_dcs() to add Citus groups and adds documentation, as well as updating the the method definition along the other commands
Pull Request Test Coverage Report for Build 6718999066
💛 - Coveralls |
This PR tries to improve monitoring by defining an overall cluster-health, similar to e.g.
etcdctl cluster-health
. This considers a cluster healthy if: (i) there is a leader, (ii) every follower is replicating from it on the same timeline and (iii) the replication lag is smaller thanmaximum_lag_on_failover
.This adds an API path
/cluster_health
that returns 200 if the overall clusters considered healthy and 5xx if not. If no leader exists the return code is
503, otherwise 500.
Also, a `cluster-health' command is added that exposes this in patronictl.
If the cluster is healthy, patronictl exits with exit code 0. If a leader exists,
but the cluster is otherwise unhealthy, the exit status 1. If not leader exists,
the exit status is 2.
Examples: