Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious streaming state if rogue Postgres server is running #2792

Open
1 task done
mbanck opened this issue Jul 31, 2023 · 2 comments
Open
1 task done

Spurious streaming state if rogue Postgres server is running #2792

mbanck opened this issue Jul 31, 2023 · 2 comments
Labels

Comments

@mbanck
Copy link
Contributor

mbanck commented Jul 31, 2023

What happened?

While trying out some ways to break Patroni and restore from backup, I ended up in a state where an old PostgreSQL process from a previous Patroni instance was stlil running, so it could not start its own Postgres:

Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,687 INFO: Lock owner: ld-pg-04; I am ld-pg-06
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,792 INFO: starting as a secondary
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: closed patroni connection to the postgresql cluster
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,795 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:04 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:04,097 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:04 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:04,098 INFO: Telling pg_ctl that it is safe to ignore postmaster.pid for process 16496
Jul 31 18:35:04 ld-pg-06 patroni@15-test[26419]: 2023-07-31 18:35:04.247 CEST [26419] FATAL:  bereits bestehender Shared-Memory-Block (Schlüssel 21902726, ID 11) wird noch benutzt
Jul 31 18:35:04 ld-pg-06 patroni@15-test[26419]: 2023-07-31 18:35:04.247 CEST [26419] TIPP:  Beenden Sie alle alten Serverprozesse, die zum Datenverzeichnis »/var/lib/postgresql/15/test« gehören.

The patronictl list output oscillates between crashed and streaming, the latter due to Patroni being able to login to the rogue Postgres instance locally via the unix socket and querying pg_stat_get_wal_receiver().

This is probably a rare corner case, but I wanted to mention it in case there can be done something about this, e.g. due to a global state that knows Postgres was not started successfully so streaming cannot be true?

Note that the second standby did not recover from the below torture, so it might be timing-related:

Jul 31 18:21:02 ld-pg-05 patroni@15-test[22620]: 2023-07-31 18:21:02,605 ERROR: No initialize key in DCS and PostgreSQL is running as replica, aborting start
Jul 31 18:21:02 ld-pg-05 patroni@15-test[22620]: 2023-07-31 18:21:02,605 ERROR: Please first start Patroni on the node running as primary

How can we reproduce it (as minimally and precisely as possible)?

  1. Put patron into maintenance_mode
  2. Drop all Patroni cluster from DCS via patronictl remove
  3. Stop Patroni on all nodes
  4. Remove dynamic JSON Patroni cache from the data directory of the primary
  5. Start Patroni on all nodes

What did you expect to happen?

Not sure, but I would expect that the new streaming state is only shown if we are positively sure that our own Postgres instance is, in fact, streaming.

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.0.4
  • PostgreSQL version: 15
  • DCS (and its version): etcd 3.3.25

Patroni configuration file

.

patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  create_replica_methods:
  - pgbackrest
  - basebackup
  parameters:
    max_connections: 100
    max_wal_senders: 10
  pg_hba:
  - local   all             all                                     peer
  - host    all             all             127.0.0.1/32            scram-sha-256
  - host    all             all             ::1/128                 scram-sha-256
  - host    all             all             10.0.3.216/24               scram-sha-256
  - local   replication     all                                     peer
  - host    replication     all             127.0.0.1/32            scram-sha-256
  - host    replication     all             ::1/128                 scram-sha-256
  - host    replication     all             10.0.3.216/24               scram-sha-256
  - host    all             postgres_rewind 10.0.3.216/24               scram-sha-256
  pgbackrest:
    command: /usr/bin/pgbackrest --stanza=15-test --log-level-file=detail --delta restore
    keep_data: true
    no_params: true
  recovery_conf:
    restore_command: pgbackrest --stanza=15-test archive-get %f %p
  remove_data_directory_on_diverged_timelines: true
  remove_data_directory_on_rewind_failure: true
  use_pg_rewind: true
retry_timeout: 10
ttl: 30

Patroni log files

see above

PostgreSQL log files

see above

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

No response

@mbanck mbanck added the bug label Jul 31, 2023
@CyberDem0n
Copy link
Collaborator

Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423

I think this behavior was already discussed a few times (for example #811).
Patroni needs to know whether the Postgres is running for a given PGDATA.
There are two different cases:

  1. When Patroni starts Postgres on its own, it doesn't use pg_ctl, but starts directly postgres binary, so it knows the postmaster PID and later just checks whether exactly this process is still alive.
  2. If Postgres is already running things get more complex:
    • It needs to check whether $PGDATA/postmaster.pid exists
    • from the file it gets PID of the postmaster process
    • checks whether there is a process in a system with this PID
    • and most important - it needs to identify whether this is really a Postgres running

To do the last thing we compare start time from postmaster.pid (it is written there by postgres) and the time of process creation.
If the difference is bigger than 3 seconds we make an assumption that it could be not Postgres, but something else (very possible in containerized environments) and as a result makes an attempt to start start PG, which obviously fails.

But, right after starting Postgres Patroni is already trying to open a connection and run monitoring queries, which obviously succeeds, therefore you sometimes get streaming.

Maybe you have a good idea how we can better identify running PG (the solution should be cross-platform)?

@CyberDem0n
Copy link
Collaborator

CyberDem0n commented Aug 10, 2023

@mbanck it seems that in your environment it is fairly easy to reproduce the problem
Could you please compare postmaster start time from the postmaster.pid file (it is on the third line) and the value returned by the pg_catalog.pg_postmaster_start_time() function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants