Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: cluster-health API path and patronictl command #1452

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

mbanck
Copy link
Contributor

@mbanck mbanck commented Mar 19, 2020

This PR tries to improve monitoring by defining an overall cluster-health, similar to e.g. etcdctl cluster-health. This considers a cluster healthy if: (i) there is a leader, (ii) every follower is replicating from it on the same timeline and (iii) the replication lag is smaller than maximum_lag_on_failover.

This adds an API path /cluster_health that returns 200 if the overall cluster
s considered healthy and 5xx if not. If no leader exists the return code is
503, otherwise 500.

Also, a `cluster-health' command is added that exposes this in patronictl.
If the cluster is healthy, patronictl exits with exit code 0. If a leader exists,
but the cluster is otherwise unhealthy, the exit status 1. If not leader exists,
the exit status is 2.

Examples:


# patronictl -c /etc/patroni/11-test.yml list
+---------+--------+-----------------+----------+---------+----+-----------+
| Cluster | Member |       Host      |   Role   |  State  | TL | Lag in MB |
+---------+--------+-----------------+----------+---------+----+-----------+
| 11-test |  pg1   | 192.168.122.206 |  Leader  | running |  3 |           |
| 11-test |  pg2   |  192.168.122.71 | Follower | running |  3 |         0 |
| 11-test |  pg3   | 192.168.122.225 | Follower | running |  3 |         0 |
+---------+--------+-----------------+----------+---------+----+-----------+
# patronictl -c /etc/patroni/11-test.yml cluster-health
cluster is healthy

# patronictl -c /etc/patroni/11-test.yml list
+---------+--------+-----------------+---------------+---------+----+-----------+
| Cluster | Member |       Host      |      Role     |  State  | TL | Lag in MB |
+---------+--------+-----------------+---------------+---------+----+-----------+
| 11-test |  pg1   | 192.168.122.206 | Uninitialized | stopped |    |   unknown |
| 11-test |  pg2   |  192.168.122.71 |     Leader    | running | 14 |           |
| 11-test |  pg3   | 192.168.122.225 | Uninitialized | stopped |    |   unknown |
+---------+--------+-----------------+---------------+---------+----+-----------+
# patronictl -c /etc/patroni/11-test.yml cluster-health
cluster has leader but is not healthy
# echo $?
1

patroni/postgresql/__init__.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/ctl.py Outdated Show resolved Hide resolved
patroni/utils.py Show resolved Hide resolved
if m.name == leader_name:
continue
if m.data.get('timeline', '') != leader_tl or int(m.data.get('lag', 0)) > maximum_lag_on_failover:
logger.warning('cluster is not healthy: timeline mismatch in member %s', m.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning is misleading, while in fact it could be on the same timeline but replication is lagging.
It is also possible that the replica is tagged with nofailover and recovery_min_apply_delay is set to some high values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the former, I would say the cluster isn't very healthy in this case, but the warning message could be improved to add the max log thing.

About the latter, I will need to think a bit more about this; maybe in this case the user cannot expect this feature to work correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to check for the nofailover tag now and if it is set, not consider replication lag on that node to render the cluster unhealthy. However, I had a hard time testing this as Patroni apparently uses sent_lsn or flush_lsn for the lag and not apply_lan (which can easily be simulated via recovery_min_apply_delay)

Also, during testing, it seemed like Patroni would assign Sync Standby to nodes with tag nofailover: true, is that possible? Maybe I broke something in my branch, I'll test it on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second paragraph is addressed by #2089/#2108

patroni/utils.py Outdated Show resolved Hide resolved
mbanck and others added 21 commits October 16, 2021 10:33
This adds an API path /cluster_health that returns 200 if the overall cluster
is considered healthy and 503 if not. Also, a `cluster-health' command is
added that exposes this in patronictl. If the cluster is not healthy,
patronictl exits with exit code 1.
It could be that a replica is not replicating properly from the leader,
however, this was not directly caught so far (only when the lag was getting too
large).  To improve the cluster-health check, get the leader's `member' API
response and check how many rows the `replication' object has. If that number
is different to the number of replicas, consider the cluster as being
unhealthy.

In passing, also log the reason for the failing cluster-health at warning
level.
Only set it to replica if node is in archive recovery, set it to empty
otherwise. In cluster_as_json(), return the actual role and not just replica
for all non-leader nodes. Finally, expclitly print replica nodes as 'Follower'
in patronictl list.
If a leader exists, but not all members are healthy, exit with exit status 1.
If no leader exists, exit with exit status 2. This is supposed to be analogous
to Nagios' WARNING (exit status 1) and CRITICAL (exit status 2).
@mbanck
Copy link
Contributor Author

mbanck commented Oct 17, 2021

Hrm, that merge looks weird

@mbanck
Copy link
Contributor Author

mbanck commented Oct 17, 2021

Hrm, that merge looks weird

That looks better after a force-push

patroni/utils.py Outdated
Comment on lines 576 to 1127
if 'tags' in m.data and 'nofailover' in m.data.get('tags'):#, {}).items():
nofailover = True
if int(m.data.get('lag', 0)) > maximum_lag_on_failover and not nofailover:
logger.warning('cluster is not healthy: replication lag in member %s', m.name)
return 500
follower_roles = ("replica", "sync_standby")
if m.data.get('role', '') not in follower_roles or m.data.get('state', '') != 'running':
logger.warning('cluster is not healthy: member %s does not have follower role or is not in state running', m.name)
return 500
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no lag in Member objects. It seems that you mixed something up with the cluster object build by cluster_as_json():

patroni/patroni/utils.py

Lines 412 to 418 in 47ebda0

if m.name == leader_name:
config = cluster.config.data if cluster.config and cluster.config.modify_index else {}
role = 'standby_leader' if is_standby_cluster(config.get('standby_cluster')) else 'leader'
elif m.name in cluster.sync.members:
role = 'sync_standby'
else:
role = 'replica'

The same about role. In the Member object it is never "sync_standby", but cluster_as_json() sets it for non-leader:

patroni/patroni/utils.py

Lines 431 to 436 in 47ebda0

if lsn is None:
member['lag'] = 'unknown'
elif cluster_lsn >= lsn:
member['lag'] = cluster_lsn - lsn
else:
member['lag'] = 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About lag, I tried to address that in credativ@259fe82

About role being sync_standby, you complained about role possibly being sync_standby in an earlier review in March 2020 - should I revert credativ@dc087e7

This fixes the call to get_dcs() to add Citus groups and adds documentation, as
well as updating the the method definition along the other commands
@coveralls
Copy link

Pull Request Test Coverage Report for Build 6718999066

  • 8 of 75 (10.67%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.5%) to 99.361%

Changes Missing Coverage Covered Lines Changed/Added Lines %
patroni/api.py 1 2 50.0%
patroni/ctl.py 5 20 25.0%
patroni/utils.py 2 53 3.77%
Totals Coverage Status
Change from base Build 6691679870: -0.5%
Covered Lines: 13212
Relevant Lines: 13297

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants