DB is corrupted if slave is on different timeline from master, but same timeline ID #1251

Vanav · 2019-10-25T02:53:56Z

Steps to reproduce

db1 is master, db2 is slave. They are on timeline ID 1 (db1).
Split brain happens. Network links are degraded, cluster lost majority in both db1 and db2 and is not accessible. db1 demotes to read-only.
I'm required to do manual failover to db2. I stop db2 Patroni and manually promote db2 to new master. New timeline ID 2 (db2) is created. I need to stop db2 Patroni because I don't want Patroni to convert it back to slave later after connection is restored.
Connection is restored, cluster becomes available. db1 is promoted back to master. New timeline ID 2 (db1) is created.
Now I have two masters with different timelines, but same timeline ID: db2 is a new real master after manual failover, and db1 that now need to be converted to slave.
I stop db1 Patroni (to release cluster leader), start db2 Patroni (it grabs cluster leader), and start db1 Patroni (will become slave).
db1 Patroni compares timeline IDs, see no difference, starts replication and it corrupts db1. This happens because timelines are really different, but Patroni and Postgres can't detect it.

Workaround

On step 5, before starting db1 Patroni, I need to manually demote db1 and promote again to create new timeline ID 3 (db1). Now Patroni should notice that timelines are different, run pg_rewind that will correctly synchronize db1.

Questions

Is there anything that can be improved by Patroni on step 6? Is there a way to differentiate timeline ID 2 (db1) and timeline ID 2 (db2)?
Do I need to stop db2 Patroni on step 2? Will cluster accept new master db2, or it will force keep old master db1?
Is there a better way to start new timeline manually (see workaround)?

May be related to #890.
Patroni 1.6.0, Postgres 9.6.

Logs and details

db2 timelines:

103     1512/B7711B88   no recovery target specified ← “timeline ID 1 (db1)”, step 2
104     1512/B7A83DE0   no recovery target specified ← “timeline ID 2 (db2)”, step 3

db1 timelines:

103     1512/B7711B88   no recovery target specified ← “timeline ID 1 (db1)”, step 2
104     1512/B7A95EA0   no recovery target specified ← “timeline ID 2 (db1)”, step 4

pg_rewind on db1 on step 6:

source and target cluster are on the same timeline
no rewind required

pg_rewind on db1 after manual timeline creation (workaround):

servers diverged at WAL position 1512/B7A83DE0 on timeline 104
rewinding from last common checkpoint at 1512/B79F3630 on timeline 104

Error messages of corrupted db1:

LOG:  invalid resource manager ID 48 at 1512/B7B35648
LOG:  record with incorrect prev-link 100E100/100F000 at 14E7/C4E3E98
PANIC:  could not locate a valid checkpoint record

Comparison of pg_controldata on db2 and db1 on step 4 (only differences):

The text was updated successfully, but these errors were encountered:

CyberDem0n · 2019-10-25T06:33:49Z

Hi @Vanav,

Yes, this is a problem. When both instances are on the same timeline, pg_rewind will not do anything (even on pg12):

$ /usr/lib/postgresql/12/bin/pg_rewind -D data/pg12.1 -n -P --source-server="dbname=postgres port=5433"
pg_rewind: connected to server
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required

Patroni knows it and therefore doesn't even trying to execute pg_rewind.

If you look into history file, it becomes clear that timelines are diverging:

$ diff -u data/pg12.{1,2}/pg_wal/00000002.history 
--- data/pg12.1/pg_wal/00000002.history 2019-10-25 08:11:30.438195187 +0200
+++ data/pg12.2/pg_wal/00000002.history 2019-10-25 08:10:53.758962843 +0200
@@ -1 +1 @@
-1      0/3011A10       no recovery target specified
+1      0/30000D8       no recovery target specified

In theory, we can implement a workaround in Patroni, i.e. always compare history files and advance to the new timeline the node which must be rewound. But I am not sure that we should do that, because you've got into such a situation artificially, by stopping Patroni on db2 and manually promoting postgres. You actions are not based on quorum, therefore you must also take care about fencing the old primary (db1). That wasn't done and you've got to such a weird situation.

Vanav · 2019-10-25T14:39:33Z

Patroni knows it and therefore doesn't even trying to execute pg_rewind.

It is ok. But instead of pg_rewind, Patroni starts replication from foreign timeline, and it corrupts DB.

I think it is a good idea to add timelines comparison to Patroni, like pg_rewind does (just skip that line of code where it compares ID first and do full comparison of timelines as in following code).

I know that I had to do manual disaster recovery and I'm responsible for all of this, but I'd like Patroni to help me in recovery and automate some stages. At least it should fail, but don't corrupt DB.

I can imagine second scenario, a bit different:

db1 is isolated, cluster is not accessible at db1, it is demoted to read-only.
db2 cluster is accessible, Patroni promotes it to master.
I'm required to manually failover back to db1. I manually promote it to master.
After restoring connection, I want to keep master at db1. I stop db2 Patroni, start db1 Patroni, start db2 Patroni.
db2 Patroni tries to start replication from db1 and it corrupts DB.

kripper · 2020-05-23T13:29:54Z

Maybe create another issue with a (not critical) feature request and close this one. The title is scary and the issue has been open for too long.

@Vanav should confirm he forced Patroni to skip quorum and created the DB corruption accidentally.

Vanav · 2020-05-23T13:36:53Z

I agree with you. This issue is not critical. But I have cases in real life, when I need to skip quorum for disaster recovery, and then start Patroni, which leads to corruption. I suggest to add extra safety check and to compare full timelines (not just TimelineID).

ldming mentioned this issue Jan 3, 2024

[BUG]postgresql containers unready after restart after upgrade kb from 0.7.2 to 0.8.0 apecloud/kubeblocks#6259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB is corrupted if slave is on different timeline from master, but same timeline ID #1251

DB is corrupted if slave is on different timeline from master, but same timeline ID #1251

Vanav commented Oct 25, 2019

CyberDem0n commented Oct 25, 2019

Vanav commented Oct 25, 2019 •

edited

kripper commented May 23, 2020

Vanav commented May 23, 2020

DB is corrupted if slave is on different timeline from master, but same timeline ID #1251

DB is corrupted if slave is on different timeline from master, but same timeline ID #1251

Comments

Vanav commented Oct 25, 2019

Steps to reproduce

Workaround

Questions

Logs and details

CyberDem0n commented Oct 25, 2019

Vanav commented Oct 25, 2019 • edited

kripper commented May 23, 2020

Vanav commented May 23, 2020

Vanav commented Oct 25, 2019 •

edited