WIP: cluster-health API path and patronictl command #1452

mbanck · 2020-03-19T08:40:07Z

This PR tries to improve monitoring by defining an overall cluster-health, similar to e.g. etcdctl cluster-health. This considers a cluster healthy if: (i) there is a leader, (ii) every follower is replicating from it on the same timeline and (iii) the replication lag is smaller than maximum_lag_on_failover.

This adds an API path /cluster_health that returns 200 if the overall cluster
s considered healthy and 5xx if not. If no leader exists the return code is
503, otherwise 500.

Also, a `cluster-health' command is added that exposes this in patronictl.
If the cluster is healthy, patronictl exits with exit code 0. If a leader exists,
but the cluster is otherwise unhealthy, the exit status 1. If not leader exists,
the exit status is 2.

Examples:


# patronictl -c /etc/patroni/11-test.yml list
+---------+--------+-----------------+----------+---------+----+-----------+
| Cluster | Member |       Host      |   Role   |  State  | TL | Lag in MB |
+---------+--------+-----------------+----------+---------+----+-----------+
| 11-test |  pg1   | 192.168.122.206 |  Leader  | running |  3 |           |
| 11-test |  pg2   |  192.168.122.71 | Follower | running |  3 |         0 |
| 11-test |  pg3   | 192.168.122.225 | Follower | running |  3 |         0 |
+---------+--------+-----------------+----------+---------+----+-----------+
# patronictl -c /etc/patroni/11-test.yml cluster-health
cluster is healthy

# patronictl -c /etc/patroni/11-test.yml list
+---------+--------+-----------------+---------------+---------+----+-----------+
| Cluster | Member |       Host      |      Role     |  State  | TL | Lag in MB |
+---------+--------+-----------------+---------------+---------+----+-----------+
| 11-test |  pg1   | 192.168.122.206 | Uninitialized | stopped |    |   unknown |
| 11-test |  pg2   |  192.168.122.71 |     Leader    | running | 14 |           |
| 11-test |  pg3   | 192.168.122.225 | Uninitialized | stopped |    |   unknown |
+---------+--------+-----------------+---------------+---------+----+-----------+
# patronictl -c /etc/patroni/11-test.yml cluster-health
cluster has leader but is not healthy
# echo $?
1

patroni/postgresql/__init__.py

patroni/api.py

patroni/ctl.py

patroni/utils.py

CyberDem0n · 2020-03-23T08:12:01Z

patroni/utils.py

+        if m.name == leader_name:
+            continue
+        if m.data.get('timeline', '') != leader_tl or int(m.data.get('lag', 0)) > maximum_lag_on_failover:
+            logger.warning('cluster is not healthy: timeline mismatch in member %s', m.name)


The warning is misleading, while in fact it could be on the same timeline but replication is lagging.
It is also possible that the replica is tagged with nofailover and recovery_min_apply_delay is set to some high values.

About the former, I would say the cluster isn't very healthy in this case, but the warning message could be improved to add the max log thing.

About the latter, I will need to think a bit more about this; maybe in this case the user cannot expect this feature to work correctly

I tried to check for the nofailover tag now and if it is set, not consider replication lag on that node to render the cluster unhealthy. However, I had a hard time testing this as Patroni apparently uses sent_lsn or flush_lsn for the lag and not apply_lan (which can easily be simulated via recovery_min_apply_delay)

Also, during testing, it seemed like Patroni would assign Sync Standby to nodes with tag nofailover: true, is that possible? Maybe I broke something in my branch, I'll test it on master.

The second paragraph is addressed by #2089/#2108

patroni/utils.py

This adds an API path /cluster_health that returns 200 if the overall cluster is considered healthy and 503 if not. Also, a `cluster-health' command is added that exposes this in patronictl. If the cluster is not healthy, patronictl exits with exit code 1.

It could be that a replica is not replicating properly from the leader, however, this was not directly caught so far (only when the lag was getting too large). To improve the cluster-health check, get the leader's `member' API response and check how many rows the `replication' object has. If that number is different to the number of replicas, consider the cluster as being unhealthy. In passing, also log the reason for the failing cluster-health at warning level.

Only set it to replica if node is in archive recovery, set it to empty otherwise. In cluster_as_json(), return the actual role and not just replica for all non-leader nodes. Finally, expclitly print replica nodes as 'Follower' in patronictl list.

If a leader exists, but not all members are healthy, exit with exit status 1. If no leader exists, exit with exit status 2. This is supposed to be analogous to Nagios' WARNING (exit status 1) and CRITICAL (exit status 2).

…otherwise

…_health as well

…ss them on in the API

…oo laggy

…licas

mbanck · 2021-10-17T19:11:11Z

Hrm, that merge looks weird

mbanck · 2021-10-17T19:38:21Z

Hrm, that merge looks weird

That looks better after a force-push

CyberDem0n · 2021-10-18T09:56:22Z

patroni/utils.py

+        if 'tags' in m.data and 'nofailover' in m.data.get('tags'):#, {}).items():
+            nofailover = True
+        if int(m.data.get('lag', 0)) > maximum_lag_on_failover and not nofailover:
+            logger.warning('cluster is not healthy: replication lag in member %s', m.name)
+            return 500
+        follower_roles = ("replica", "sync_standby")
+        if m.data.get('role', '') not in follower_roles or m.data.get('state', '') != 'running':
+            logger.warning('cluster is not healthy: member %s does not have follower role or is not in state running', m.name)
+            return 500


There is no lag in Member objects. It seems that you mixed something up with the cluster object build by cluster_as_json():

patroni/patroni/utils.py

Lines 412 to 418 in 47ebda0

if m.name == leader_name:

config = cluster.config.data if cluster.config and cluster.config.modify_index else {}

role = 'standby_leader' if is_standby_cluster(config.get('standby_cluster')) else 'leader'

elif m.name in cluster.sync.members:

role = 'sync_standby'

else:

role = 'replica'

The same about role. In the Member object it is never "sync_standby", but cluster_as_json() sets it for non-leader:

patroni/patroni/utils.py

Lines 431 to 436 in 47ebda0

if lsn is None:

member['lag'] = 'unknown'

elif cluster_lsn >= lsn:

member['lag'] = cluster_lsn - lsn

else:

member['lag'] = 0

About lag, I tried to address that in credativ@259fe82

About role being sync_standby, you complained about role possibly being sync_standby in an earlier review in March 2020 - should I revert credativ@dc087e7

This fixes the call to get_dcs() to add Citus groups and adds documentation, as well as updating the the method definition along the other commands

coveralls · 2023-11-01T11:28:42Z

Pull Request Test Coverage Report for Build 6718999066

8 of 75 (10.67%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.5%) to 99.361%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
patroni/api.py	1	2	50.0%
patroni/ctl.py	5	20	25.0%
patroni/utils.py	2	53	3.77%

Totals
Change from base Build 6691679870:	-0.5%
Covered Lines:	13212
Relevant Lines:	13297

💛 - Coveralls

CyberDem0n reviewed Mar 23, 2020

View reviewed changes

mbanck and others added 21 commits October 16, 2021 10:33

Move get_cluster_status from api to utils as is_cluster_healthy

bd38173

Make replica role more explicit.

72236e4

Only set it to replica if node is in archive recovery, set it to empty otherwise. In cluster_as_json(), return the actual role and not just replica for all non-leader nodes. Finally, expclitly print replica nodes as 'Follower' in patronictl list.

Fine-tune patronictl cluster-health exit status.

0ec30d0

If a leader exists, but not all members are healthy, exit with exit status 1. If no leader exists, exit with exit status 2. This is supposed to be analogous to Nagios' WARNING (exit status 1) and CRITICAL (exit status 2).

Avoid KeyError exception if data[replication] is not present

5bba30a

Fix tab vs. whitespace

f71bdd3

Make API return 500 if cluster is unhealthy but a leader exists, 503 …

67d3e48

…otherwise

Rename API endpoint from cluster_health to cluster-health

d221ee1

Add suggested change in review

d2a4eee

Also account for sync_standby as a follower role

dc087e7

Fix tab vs. whitespace

83780f2

Fix syntax errors

9f17248

Split up warning messages about wrong timeline and replication lag

bef0e92

Move cluster-health path handling to general do_GET API function

74fee0d

Make sure cluster-health is evaluated before health and allow cluster…

edbebbf

…_health as well

Make is_cluster_healthy return HTTP status codes directly and just pa…

795aaf1

…ss them on in the API

Remove changes that modify member role status

793eb70

Handle REST API error in cluster-health

0ad144d

Check whether a node has the nofailover tag set before declaring it t…

5a7dee2

…oo laggy

Ignore cascading standbys when checking for the correct number of rep…

e746ca4

…licas

Fix whitespace

5255bdd

mbanck force-pushed the member-status branch from 5f04f3b to 5255bdd Compare October 17, 2021 19:37

CyberDem0n reviewed Oct 18, 2021

View reviewed changes

Fix computation of lag

259fe82

mbanck force-pushed the member-status branch from 259fe82 to ef428e7 Compare December 31, 2021 14:10

Merge branch 'master' into member-status

9628ee6

mbanck-ntap added 4 commits November 1, 2023 11:04

Update cluster_health ctl command.

7094596

This fixes the call to get_dcs() to add Citus groups and adds documentation, as well as updating the the method definition along the other commands

Fix state check to use replication_state

8cc920e

Fix error message

68f76e6

Address flake8 concerns

7267b3a

mbanck force-pushed the member-status branch from ef428e7 to 7267b3a Compare November 1, 2023 11:17

mbanck requested a review from hughcapet as a code owner November 1, 2023 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: cluster-health API path and patronictl command #1452

WIP: cluster-health API path and patronictl command #1452

mbanck commented Mar 19, 2020 •

edited

CyberDem0n Mar 23, 2020

mbanck Aug 31, 2021

mbanck Oct 16, 2021

mbanck Dec 31, 2021

mbanck commented Oct 17, 2021

mbanck commented Oct 17, 2021

CyberDem0n Oct 18, 2021

mbanck Oct 18, 2021

coveralls commented Nov 1, 2023

	if m.name == leader_name:
	config = cluster.config.data if cluster.config and cluster.config.modify_index else {}
	role = 'standby_leader' if is_standby_cluster(config.get('standby_cluster')) else 'leader'
	elif m.name in cluster.sync.members:
	role = 'sync_standby'
	else:
	role = 'replica'

	if lsn is None:
	member['lag'] = 'unknown'
	elif cluster_lsn >= lsn:
	member['lag'] = cluster_lsn - lsn
	else:
	member['lag'] = 0

WIP: cluster-health API path and patronictl command #1452

Are you sure you want to change the base?

WIP: cluster-health API path and patronictl command #1452

Conversation

mbanck commented Mar 19, 2020 • edited

CyberDem0n Mar 23, 2020

Choose a reason for hiding this comment

mbanck Aug 31, 2021

Choose a reason for hiding this comment

mbanck Oct 16, 2021

Choose a reason for hiding this comment

mbanck Dec 31, 2021

Choose a reason for hiding this comment

mbanck commented Oct 17, 2021

mbanck commented Oct 17, 2021

CyberDem0n Oct 18, 2021

Choose a reason for hiding this comment

mbanck Oct 18, 2021

Choose a reason for hiding this comment

coveralls commented Nov 1, 2023

Pull Request Test Coverage Report for Build 6718999066

💛 - Coveralls

mbanck commented Mar 19, 2020 •

edited