Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remediation of MISSING_LOST segments. #1590

Open
josephglanville opened this issue Nov 1, 2023 · 3 comments
Open

Remediation of MISSING_LOST segments. #1590

josephglanville opened this issue Nov 1, 2023 · 3 comments

Comments

@josephglanville
Copy link

Hi,

We are using wal-g with PostgreSQL and it's been great thus far.

Unfortunately we ran into some issues where we had repeated OOM events that caused unclean shutdowns that had also prevented WAL segments being archived. These segments are now long-lost unfortunately.

We know these segments are gone and aren't recoverable at this stage. What is the best way to return wal-verify integrity to reporting SUCCESS?

For posterity here is the output of the integrity job:

+-----+--------------------------+--------------------------+----------------+--------------+
| TLI | START                    | END                      | SEGMENTS COUNT |       STATUS |
+-----+--------------------------+--------------------------+----------------+--------------+
| 111 | 0000006F00000B94000000B7 | 0000006F00000C89000000F5 |          62783 |        FOUND |
| 112 | 0000007000000C89000000F6 | 0000007000000CB40000005B |          10854 |        FOUND |
| 113 | 0000007100000CB40000005C | 0000007100000CBE000000D6 |           2683 |        FOUND |
| 114 | 0000007200000CBE000000D7 | 0000007200000CBF00000016 |             64 |        FOUND |
| 114 | 0000007200000CBF00000017 | 0000007200000CBF00000017 |              1 | MISSING_LOST |
| 115 | 0000007300000CBF00000018 | 0000007300000CBF00000066 |             79 |        FOUND |
| 115 | 0000007300000CBF00000067 | 0000007300000CBF00000067 |              1 | MISSING_LOST |
| 116 | 0000007400000CBF00000068 | 0000007400000CBF000000AA |             67 |        FOUND |
| 116 | 0000007400000CBF000000AB | 0000007400000CBF000000AB |              1 | MISSING_LOST |
| 117 | 0000007500000CBF000000AC | 0000007500000CC10000003E |            403 |        FOUND |
| 118 | 0000007600000CC10000003F | 0000007600000CC40000007E |            832 |        FOUND |
+-----+--------------------------+--------------------------+----------------+--------------+

Thanks in advance!

@x4m
Copy link
Collaborator

x4m commented Nov 2, 2023

Hi! You can archive empty file as 0000007200000CBF00000017, 0000007300000CBF00000067 and 0000007400000CBF000000AB.
I understand that this does not sound as a proper solution... Maybe we could add functionality like "mute absance of specific files" or something like that...

@x4m
Copy link
Collaborator

x4m commented Nov 2, 2023

Meanwhile I'd appreciate if you describe how you lost WALs in more details. We should consider if WAL-G could prevent this.

@josephglanville
Copy link
Author

Hi @x4m!

What happened was we made some adjustments to a Patroni managed cluster in order to band-aid some application problems. Namely we increased max_connections fairly substantially. We also increased the memory allocation but apparently not by a sufficient amount.

What then followed was the k8s pods reached a near OOM state. PostgreSQL then failed to archive segments with this message:

DETAIL:  The failed archive command was: envdir "/run/etc/wal-e.d/env" wal-g wal-push "pg_wal/0000007300000CBF00000067"

Followed by one of the backends geting OOM Killed and then postmaster shutting down and killing the archive task.

After failover I thought these segments would get uploaded by they did not. It does not appear that PostgreSQL even tried to but I am not that familiar with what happens after WAL archiving fails or if the status of not-archived segments is persisted to the secondary.

There was a trace from wal-g but it was because it got a SIGQUIT from postgres, likely during the ungraceful shutdown, other than that nothing else to indicate what went wrong. As far as I can tell wal-g wasn't killed by the OOM killer as I have no logs of it being killed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants