-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check for writeability on master #665
Comments
So, first, do you understand why most users don't actually want this kind of check? Second, I think we have a hook location for adding your own extra healthchecks. @CyberDem0n ? |
Yes. Not everyone likes someone messing around with database with RW requests. We have custom tests for simple RW check - insert/update/select/delete in special table which pretty small but still exists. So I prefer second option there user can specify additional healthcheck/reaction if required. It will become user headache to define cases which requires failover and cases which is just maintenance. Maybe some of such checks and reactions can be included in archive with patroni but disabled by default. |
Can you describe some specific failure scenarios where master is not responding correctly and failing over would be helpful to the situation? In the full disk example, in a typical deployment all nodes have the same amount of disk, so failing over is unlikely to improve availability. More likely you will then have a full disk on all nodes and if implemented correctly a cascade of failovers across all nodes, if not, a neverending cycle of failovers. Partial hardware failures (I/O errors or slowness, etc.) could and I think should be monitored and handled outside of Patroni. Probably by removing the whole node from the cluster. To turn a node readonly an external agent could already turn on nofailover tag of Patroni and trigger a failover. |
In some cases (logging configuration error, core dumps, parts of some backup, another application error, stuck wals because of abandoned replication slot, etc) we can see full disk only on master. We do check if slave has enough free space to prevent from failover from full master to full slave. |
In general, I agree with @ants, I don't really like an idea of turning Patroni into a monitoring agent which also perform some reactive feedback actions. |
@CyberDem0n we don't want this as a core feature. However, is there a hook for adding things to the availability check without forking Patroni? If not, we should add one. |
@jberkus But I don't really like to have such functionality, because it will become the biggest foot-shotgun ever... It seems that we are putting together a lot of different things here:
Right now on every iteration of HA loop Patroni is verifying that Most of the checks for the above mentioned conditions must be executed not only on the master but on all nodes in the cluster, otherwise you can really get a cascade of failovers. How are you going to run DDL/DML on replicas? Even ability to perform INSERT doesn't really gives us a guaranty that data would be persisted on disk (problems with fsync). In wast majority of cases (except disk space issues) "failing" node must be removed from cluster and replaced with the new one. This action should be done either by some person or external tooling rather than Patroni. And regarding diskspace issues: there could be dozens of different reasons why disk space is eaten and in my opinion none of them deserves "automated" switchover, because at the end you'll get into the same situation on the replica and have another switchover... |
Today I had issue with failover - someone forgot to consume data from replication slot and postgres starts acting funny after disk has been filled. It starts and tries to enter recovery mode. After that it fails because cannot create temporary files. Patroni tried to start it 2 days untill it generate enough garbage to prevent postgres from creating lock file during start. Only after this failover occured and new master was elected. This case has two sides:
|
We have custom scripts which every 30 second checks if postgres available for RW operations and performs swichover if postgres cannot serve RW requests and disk is almost full. Patroni does not check if master available for RW operations and keeps master the same even though postgres cannot perform single insert.
So I wondering if patroni can check if postges is available for RW operations and hande some scenario (like full disk) or allow users to specify callbalck if some events occures.
In first option patroni can check if postgres available for RW operation, consider other parameters ( someone put postgres in recovery mode manually for backup or something like that) and perform switchover on healthy replica if possible.
In second option patroni can check if postgres availiable for RW operation and call callback with paramtetrs like connection to DB, occurrence count, state transition (RW>RO/RO>RW), uptime.
The text was updated successfully, but these errors were encountered: