Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue encountered while adding script for split-brain prevention #846

Open
seunofk opened this issue Feb 21, 2024 · 1 comment
Open

Issue encountered while adding script for split-brain prevention #846

seunofk opened this issue Feb 21, 2024 · 1 comment

Comments

@seunofk
Copy link

seunofk commented Feb 21, 2024

Hello.

When there's a network interface card (NIC) failure,

I want to create a script that detects it in repmgrd and performs subsequent actions.

If the Primary DB loses its NIC connection for 10 seconds, I want the Primary DB to be forcibly terminated,

and the Standby DB to be promoted to take over.

However, although the Standby promotion occurs, the Primary DB does not stop.

Is repmgrd daemon unable to detect NIC disconnection?

  1. repmgr version : 5.3.3
  2. postgresql version : 15.3
#!/bin/bash

PRIMARY_IP="10.12.30.191"
STANDBY_IP="10.12.30.192"
REPMGR_CONFIG="/postgres15/app/postgres/etc/repmgr.conf"
PGLOG="/pglog/repmgrd.log"

function echodate() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')]"
}

# Function to stop PostgreSQL on primary server
function stop_primary_db() {
    echo "$(echodate) [FAILOVER] Stopping primary PostgreSQL database" >> "$PGLOG"
    repmgr -f "$REPMGR_CONFIG" node service --action=stop
}

# Check if primary server needs to be shut down
ping -c 1 -W 10 "$PRIMARY_IP" > /dev/null 2>&1
ping_exit_code=$?
if [ $ping_exit_code -ne 0 ]; then
    # Ping to primary server timed out or failed, stop PostgreSQL and exit
    stop_primary_db
    exit 0
fi

# No failover condition met, exit
echo "$(echodate) [FAILOVER] No failover condition met, continuing normal operation" >> "$PGLOG"
exit 0
election_rerun_interval=10
# =============================================================================
# Required configuration items
# =============================================================================
node_id=2
node_name='postgresdb192'
conninfo='host=postgresdb192 user=repmgr dbname=postgres connect_timeout=2'
data_directory='/postgres15/data'

#------------------------------------------------------------------------------
# Replication settings
#------------------------------------------------------------------------------
use_replication_slots=yes

#------------------------------------------------------------------------------
# Logging settings
#------------------------------------------------------------------------------
log_level=INFO
log_facility=STDERR
log_file='/pglog/repmgrd.log'

#------------------------------------------------------------------------------
# Environment/command settings
#------------------------------------------------------------------------------
pg_bindir='/postgres15/app/postgres/bin'

#------------------------------------------------------------------------------
# external command options
#------------------------------------------------------------------------------
pg_ctl_options='-s -l /dev/null'
ssh_options='-q -o ConnectTimeout=10'

#------------------------------------------------------------------------------
# Standby follow settings
#------------------------------------------------------------------------------
primary_follow_timeout=60

#------------------------------------------------------------------------------
# Failover and monitoring settings (repmgrd)
#------------------------------------------------------------------------------
failover=automatic
priority=100
reconnect_attempts=3
reconnect_interval=5
promote_command='repmgr standby promote -f /postgres15/app/postgres/etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /postgres15/app/postgres/etc/repmgr.conf -W --upstream-node-id=%n --log-to-file'
monitoring_history=true
failover_validation_command='/postgres15/app/postgres/etc/failover.sh'
election_rerun_interval=10
#degraded_monitoring_timeout=-1
@stephan-hahn
Copy link

Hi,
how do you execute your script?
You could also use child_nodes_connected_min_count to manage more types of failures.

Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants