Document what happens if you start pgbackrest with an old copy of the archive folder #2229

SystemParadox · 2023-11-21T12:21:20Z

Hi, sorry if this has been answered somewhere but I haven't been able to find any information about it...

We're wondering how robust pgbackrest is in terms of being able to resume operations if it gets started with an old copy of the archive. For example:

Consider a running system with pgbackrest archiving working and a number of backups already created
Whilst the system is running (and actively writing to the database), rsync -a /var/lib/pgbackrest /mnt/backup/pgbackrest
Leave the system running for a while so that /var/lib/pgbackrest is ahead of the copy in /mnt/backup/pgbackrest
Delete /var/lib/pgbackrest and mount /mnt/backup/pgbackrest in its place (and restart postgres to ensure everything is updated)

What happens?

I was expecting pgbackrest verify to complain about missing archive segments. But when I tested this it all verified fine and pgbackrest seems perfectly happy with no warnings whatsoever.

Can anyone provide any more information about this? Why did pgbackrest verify work without any warnings? What would cause it to fail? Does pgbackrest have any ability to fetch missing information from postgres if for whatever reason it's missing data that was pushed to archive-push?

I feel like it would be helpful if the documentation included a section about this.

Thanks!

The text was updated successfully, but these errors were encountered:

pgstef · 2023-11-21T13:45:52Z

I was expecting pgbackrest verify to complain about missing archive segments. But when I tested this it all verified fine and pgbackrest seems perfectly happy with no warnings whatsoever.

You're using an old "repo" so the verify command checks what is inside this repo (backup files compared to manifest files,...).
Nothing is "missing", it is just old.

That's the difference between verifying the content of the repo and monitoring it to make sure it is up-to-date.

Does pgbackrest have any ability to fetch missing information from postgres if for whatever reason it's missing data that was pushed to archive-push?

Where would you find the WAL segments that PG would have removed/recycled after it successfully pushed it to the pgBackRest repository? That's how PG works. Once the archive_command has run successfully, it won't keep the wal segment forever.

If you lose your most up-to-date repository, you'll have to take a new backup but won't be able to perform PITR between the last wal archive in your "old" repo and your new backup.

I feel like it would be helpful if the documentation included a section about this.

Hm, I'm not sure what we could document here exactly. I mean, that's a generic PG behavior. If you lose your backups and archives, PG won't be able to give it again a few days/weeks later. That's also the all point of having multiple repositories defined.

SystemParadox · 2023-11-21T15:08:26Z

Ok so I realise that some of this might be "assumed knowledge" about "how PG works", but it's worth pointing out that unless you've worked with replication you're not going to know this and even then I'd argue that the docs don't fully explain how that relates to pgbackrest. I'm reasonably familiar with replication from mysql and I've really struggled to work out exactly what's going on. I get the principle, but some of the specifics have been rather ellusive.

Some things that would be helpful to add that I think are missing from the docs, some of which I'm still fuzzy on:

Exactly what data pgbackrest backup uses and when. For example, does everything always come from the WAL archive, or can it also pull data directly from postgres and if so how/when?
The fact that you can't just move/copy the pgbackrest folder and switch to the new folder while the system is running without first doing pgbackrest stop to prevent missing segments
Some notes about the consequences of missing (or corrupt?) WAL segments, if for example you make a copy like I described or temporarily disable archive_command or some segments are randomly deleted or something.
- Does everything work but you just can't PITR restore within that time frame?
- Does it force the next backup to be a full rather than a diff/incremental?
- Can this ever cause pgbackrest backup to fail and if so how do you fix it?
What data pgbackrest restore uses and when. In particular it took me ages to work out what the relationship was with backups vs WAL archive data (see my notes below about the use of the term "consistent")
The fact that the archive folder should be considered part of the backup. I honestly had no idea and for some time was only backing up the backup folder.
- and/or the consequences of only backing up the backup folder and omitting the archive folder
- How to recover (both the postgres data and get pgbackrest working again) if the backup folder is all you have

Usage of the term "consistent"

The documentation uses the term "consistent" in a very confusing/misleading way. It took me ages to work out what it meant by statements like "Expiring archive will never remove WAL segments that are required to make a backup consistent". If you just say "consistent" then that means consistent with itself, as in a transactionally valid snapshot. What the documentation usually means is "bring it right up to date using the WAL data" which is not at all clear.

The term "consistent" makes some sense if you're working with a replication cluster as you're making it "consistent with the rest of the cluster", but even then I would argue that this usage is confusing. And again, if you're not working with replication then it's just outright misleading.

pgstef · 2023-11-21T15:22:42Z

You're simply confusing replication with PG continuous archiving and point-in-time recovery backups.
Consistency point is very commonly used in the PG docs, which pretty clearly indicates the link between WAL segments and data files (https://www.postgresql.org/docs/current/continuous-archiving.html).

Imho, we shouldn't replicate PG docs into pgBackRest docs, that would add too much details for something very nicely explained somewhere else. But other contributors might have other opinions.

Anyway, if you want to add something to the docs, you can always submit a PR to start a discussion about it ;-)

sfrost · 2023-11-22T15:31:54Z

Some things that would be helpful to add that I think are missing from the docs, some of which I'm still fuzzy on:

Exactly what data pgbackrest backup uses and when. For example, does everything always come from the WAL archive, or can it also pull data directly from postgres and if so how/when?

pgbackrest backup does not back up anything in the WAL, in fact, it copies the appropriate data files from the PostgreSQL data directory (note that some things don't get copied because copying them would be pointless as PG will just delete them on restore, such as unlogged tables). The archive_command and pgbackrest archive-push is what handles the copying of the WAL files to the pgbackrest repo. The pgbackrest backup command does verify that the WAL necessary for the backup got pushed to the pgbackrest repo.

The fact that you can't just move/copy the pgbackrest folder and switch to the new folder while the system is running without first doing pgbackrest stop to prevent missing segments

The pgbackrest repo is able to be copied at any time. You don't need to use pgbackrest stop. No segments will go missing, but of course the copied repo will only have the segments in it that were there when the copy was done. If you decide to move the pgbackrest repo then you'll need to update the pgbackrest config and until you do that the archive-push command will fail- but that's fine, PG will handle that gracefully and will just retry archiving the segment later. Again, no WAL is going to be lost.

Some notes about the consequences of missing (or corrupt?) WAL segments, if for example you make a copy like I described or temporarily disable archive_command or some segments are randomly deleted or something.

If segments go missing or get corrupted then the pgbackrest verify will fail- for those segments which are part of a backup or which are in the repo. In the case you describe, what you're doing is actually a completely supported operation and everything is just fine, but of course you only have the data from when the copy was done and pgbackrest has no idea that there was ever any additional WAL segments.

Does everything work but you just can't PITR restore within that time frame?

If there's missing WAL then you can't do PITR through that time frame, yes.

Does it force the next backup to be a full rather than a diff/incremental?

No, because it doesn't need to be a full backup. An new incremental backup will work just fine. This is because the 'incremental' in pgbackrest is copying the data files that have changed since the full backup and isn't working from the WAL. There will be WAL generated as part of whatever new backup is done and that WAL will be verified by pgbackrest to exist at the end of the backup.

Can this ever cause pgbackrest backup to fail and if so how do you fix it?

Not sure what "this" is referring to here, but if you mean the copying of the repo and the restoration of it later and then have it be used by a new pgbackrest backup command then the short answer is 'no, that should not ever cause pgbackrest backup to fail'.

What data pgbackrest restore uses and when. In particular it took me ages to work out what the relationship was with backups vs WAL archive data (see my notes below about the use of the term "consistent")

The pgbackrest restore will use the data from whatever backups are referenced by the backup set which is restored (eg: if you have an incremental X that you are restoring, it may depend on incremental Y, differential Z, and full ZZ, and pgbackrest will pull the appropriate files from each of X, Y, Z and ZZ). pgbackrest will also configure PG to have a restore_command which will fetch, when PG starts up, the necessary WAL (all of the WAL which was generated during the X incremental backup, following from the above example, and then any WAL after the end of the X backup until either the end of the WAL or until the recovery target has been reached).

The fact that the archive folder should be considered part of the backup. I honestly had no idea and for some time was only backing up the backup folder.

You should be backing up the entire pgbackrest repo. Do not think you can back up only a part of it. If that's unclear in the pgbackrest documentation then please point out to us where it's unclear and we can work on improving that.

and/or the consequences of only backing up the backup folder and omitting the archive folder

Only backing up part of the pgbackrest repo is not a supported operation. If you do that, you won't be able to restore.

How to recover (both the postgres data and get pgbackrest working again) if the backup folder is all you have

There is no way to get back to a consistent state if you've thrown away the absolutely required WAL archive data.

Usage of the term "consistent"

The documentation uses the term "consistent" in a very confusing/misleading way. It took me ages to work out what it meant by statements like "Expiring archive will never remove WAL segments that are required to make a backup consistent". If you just say "consistent" then that means consistent with itself, as in a transactionally valid snapshot. What the documentation usually means is "bring it right up to date using the WAL data" which is not at all clear.

The term "consistent" makes some sense if you're working with a replication cluster as you're making it "consistent with the rest of the cluster", but even then I would argue that this usage is confusing. And again, if you're not working with replication then it's just outright misleading.

A restored PG cluster is not in a consistent state until all of the WAL which was generated during a backup has been replayed. If all you have are the backups and none of the WAL then all you have is an inconsistent PG system which has whatever happened to be copied at the time that pgbackrest got to it. There are ways to recover some of this inconsistent data from the PG cluster but it won't be consistent- foreign keys won't be valid, primary keys won't be valid, etc, etc, etc.

SystemParadox · 2024-02-09T22:29:17Z

Thank you both for the replies.

Imho, we shouldn't replicate PG docs into pgBackRest docs, that would add too much details for something very nicely explained somewhere else. But other contributors might have other opinions.

I would also agree. Assumed knowledge is fine, but the docs should have a clear statement that says something like "pgbackrest is mostly just a convenient interface around the postgres continuous archiving mechanism... these docs assume you already know about that - go read about that first [here]". I read the whole pgbackrest docs several times trying to work all this out and never got the slightest hint. There's even a "concepts" section but it doesn't mention this.

Anyway, if you want to add something to the docs, you can always submit a PR to start a discussion about it ;-)

I can't do that until I actually understand it! :)

You should be backing up the entire pgbackrest repo. Do not think you can back up only a part of it. If that's unclear in > the pgbackrest documentation then please point out to us where it's unclear and we can work on improving that.

My understanding so far was that the archive was only needed if you wanted to bring a backup right up to date or do PITR, but what you seem to be saying is that if you don't have the archive you can't restore at all?

I can't point to anything that's unclear in the docs because that's precisely the point - there isn't anything that explains what the pgbackrest output files are for. There's a folder literally called backup but there isn't anything to point out that this isn't "the backup" and that you need the archive as well.

A restored PG cluster is not in a consistent state until all of the WAL which was generated during a backup has been replayed. If all you have are the backups and none of the WAL then all you have is an inconsistent PG system which has whatever happened to be copied at the time that pgbackrest got to it. There are ways to recover some of this inconsistent data from the PG cluster but it won't be consistent- foreign keys won't be valid, primary keys won't be valid, etc, etc, etc.

Ok I think this is probably the key point I've been missing - you need to replay the WAL from during the backup in order to make it consistent, which would mean consistent in the way that I'm expecting.

You can also replay the rest of the WAL after (which I think pgbackrest does by default?) the end of the backup to bring it right up to date, but this isn't what the docs are referring to when they say consistent - which I definitely think needs to be clarified.

So to summarise what I think we need to add:

A clear statement+link to background knowledge required
Explanations of what files pgbackrest generates and where this data comes from
When/what these files are used for
- Including a clear note that you need the archive folder as well as the backup folder with the explanation of why
Rather than just saying "WAL" everywhere, state much more clearly what WAL we're talking about - before, during, between backups, or after the backup
- On a related note, the docs mention that you can do aggressive WAL expiry to drop all PITR support but doesn't say how - you have to combine several options to achieve this and took me quite a while to work out

dwsteele assigned pgstef Nov 21, 2023

dwsteele added the question label Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document what happens if you start pgbackrest with an old copy of the archive folder #2229

Document what happens if you start pgbackrest with an old copy of the archive folder #2229

SystemParadox commented Nov 21, 2023

pgstef commented Nov 21, 2023

SystemParadox commented Nov 21, 2023

pgstef commented Nov 21, 2023

sfrost commented Nov 22, 2023

Usage of the term "consistent"

SystemParadox commented Feb 9, 2024 •

edited

Document what happens if you start pgbackrest with an old copy of the archive folder #2229

Document what happens if you start pgbackrest with an old copy of the archive folder #2229

Comments

SystemParadox commented Nov 21, 2023

pgstef commented Nov 21, 2023

SystemParadox commented Nov 21, 2023

Usage of the term "consistent"

pgstef commented Nov 21, 2023

sfrost commented Nov 22, 2023

Usage of the term "consistent"

SystemParadox commented Feb 9, 2024 • edited

SystemParadox commented Feb 9, 2024 •

edited