Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A missing wal in S3 storage after wal-push connection timeout #1578

Open
M1ha-Shvn opened this issue Oct 12, 2023 · 3 comments
Open

A missing wal in S3 storage after wal-push connection timeout #1578

M1ha-Shvn opened this issue Oct 12, 2023 · 3 comments

Comments

@M1ha-Shvn
Copy link

Database name

PostgreSQL 9.6.24
wal-g version v2.0.1 b7d53dd 2022.08.25_09:34:18 PostgreSQLd

Issue description

wal-push has not saved WAL file to S3 after Connection timeout error

Describe your problem

I use wal-push in order to upload wals to Amazon S3 storage. Yesterday I've ran a full backup using wal-backup command. Backup was successfull. After that I've tried to restored it to replica and ran Postgres. What I got was:

LOG:  started streaming WAL from primary at B7AC/E9000000 on timeline 8
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000080000B7AC000000E9 has already been removed

I've started digging, and found out that this file is really absent in Amazon, and found a log:

ERROR: 2023/10/11 04:54:10.401599 failed to upload 'shard_2/wal_005/000000080000B7AC000000E9.lz4' to bucket 'db-backup': RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/wal_005/000000080000B7AC000000E9.lz4": write tcp 172.16.200.144:57186->3.5.70.154:443: write: connection timed out
INFO: 2023/10/11 04:54:10.401626 FILE PATH: 000000080000B7AC000000E9.lz4
ERROR: 2023/10/11 04:54:10.401641 upload: could not Upload 'pg_xlog/000000080000B7AC000000E9'
: failed to upload 'shard_2/wal_005/000000080000B7AC000000E9.lz4' to bucket 'db-backup': RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/wal_005/000000080000B7AC000000E9.lz4": write tcp 172.16.200.144:57186->3.5.70.154:443: write: connection timed out
INFO: 2023/10/11 04:54:28.710598 FILE PATH: 000000080000B7AC000000EA.lz4

No other logs about missing wal were found.

I don't understand, why postgres haven't repeat uploading? archive_command returned zero exit code? But how it happend? If I run the command manually with not existing file it gives me 1 exit code, not 0. I'm not sure how can I reproduce timeout situation.

Configuration

postgresql.conf:

archive_mode = 'on'
archive_command = '/usr/local/scripts/walg-push.sh %p &> /dev/null'
archive_timeout = '60'

walg-push.sh file:

set -o noclobber  # Avoid overlay files (echo "hi" > foo)
set -o errexit    # Used to exit upon error, avoiding cascading errors
set -o nounset    # Exposes unset variables
set -o pipefail   # Unveils hidden failures

set -o allexport
source /etc/default/wal-g
set +o allexport

/usr/local/bin/wal-g wal-push $1

env file:

# AWS S3 setup
AWS_ACCESS_KEY_ID=my-key
AWS_SECRET_ACCESS_KEY=my-key
AWS_REGION=eu-west-1
AWS_ENDPOINT=https://s3.dualstack.eu-west-1.amazonaws.com
AWS_S3_FORCE_PATH_STYLE=False
WALG_S3_PREFIX=s3://db-backup/shard_2
WALG_S3_STORAGE_CLASS=STANDARD_IA
WALG_S3_MAX_RETRIES=5

# PostgreSQL setup
PGHOST=/var/run/postgresql

# Wal-G setup
TOTAL_BG_UPLOADED_LIMIT=32
WALG_COMPRESSION_METHOD=lz4
WALG_DELTA_MAX_STEPS=0
WALG_DELTA_ORIGIN=LATEST
WALG_DOWNLOAD_CONCURRENCY=16
WALG_UPLOAD_CONCURRENCY=16
WALG_UPLOAD_DISK_CONCURRENCY=6
@x4m
Copy link
Collaborator

x4m commented Oct 12, 2023

Hi!
Yes, wal-g will return non-zero return code upon failure. And Postgres will retry archival of the file. Please check:

  1. what is in pg_stat_archive
  2. change AWS endpoint and run your archive manually. Does it return non-zero exit code
  3. your restore_command. The error in log that you posted is not from wal-fetch, it's from streaming replication
  4. contents of pg_wal/archive_status. There are files .ready and .done for files ready for archivation and done with archivation.

@M1ha-Shvn
Copy link
Author

M1ha-Shvn commented Oct 12, 2023

Hi, thanks for quick reply.

  1. what is in pg_stat_archive

Seems, no errors?

archived_count |    last_archived_wal     |      last_archived_time       | failed_count | last_failed_wal | last_failed_time |          stats_reset
----------------+--------------------------+-------------------------------+--------------+-----------------+------------------+-------------------------------
          21949 | 000000080000B7EB00000004 | 2023-10-12 09:54:30.988245+00 |            0 |                 |                  | 2023-10-10 12:05:28.317102+00
  1. change AWS endpoint and run your archive manually. Does it return non-zero exit code

That was an accidental timeout error. There are lots of WALs before and after missing one, that ran successfully. I'm not sure how to reproduce a connection timeout situation. I've tried sending another existing WAL to non-existing host, and tried to send not-existing WAL. In both cases I got exit code 1, so it seems to be correct. I'm not sure how to reproduce the very same timeout error in order to check the exact situation.

your restore_command. The error in log that you posted is not from wal-fetch, it's from streaming replication

The problem here is not with restore. File has not been uploaded to S3 by wal-push command. I've already gone to storage and checked this out. Nevertheless, here are restore commands:

# recovery.conf
restore_command = '/usr/local/scripts/walg-fetch.sh "%f" "%p"'
archive_cleanup_command = 'pg_archivecleanup /srv/postgresql/9.6/main/pg_xlog "%r"'

# /usr/local/scripts/walg-fetch.sh
set -o noclobber  # Avoid overlay files (echo "hi" > foo)
set -o errexit    # Used to exit upon error, avoiding cascading errors
set -o nounset    # Exposes unset variables
set -o pipefail   # Unveils hidden failures

set -o allexport
source /etc/default/wal-g
set +o allexport

/usr/local/bin/wal-g wal-fetch $1 $2

contents of pg_wal/archive_status. There are files .ready and .done for files ready for archivation and done with archivation.

The database is highly loaded. I've also checked, that there is no such file on master server in both pg_xlog and pg_xlog/archive_status directory now. Neigther ready or done. The bug in wal-push occured yesterday, as it has not been expected, It has not been noticed. I've found it only today, while restoring from backup. But the fact, that the file has been removed from both pg_xlog and archive_status, may mean that postgres "thinks" it has been uploaded successfully.

@M1ha-Shvn
Copy link
Author

M1ha-Shvn commented Oct 13, 2023

Got a very similar error while doing a backup with backup-push command. It is wierd, there were no retries, but I have WALG_S3_MAX_RETRIES=5
Log:

ERROR: 2023/10/13 07:15:30.005112 failed to upload 'shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4' to bucket 'db-backup': MultipartUpload: upload multipart failed
        upload id: fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60
caused by: RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4?partNumber=10&uploadId=fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60": write tcp 172.16.200.144:49530->52.218.108.216:443: write: connection timed out
ERROR: 2023/10/13 07:15:30.005127 upload: could not upload 'base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4'
ERROR: 2023/10/13 07:15:30.005144 failed to upload 'shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4' to bucket 'db-backup': MultipartUpload: upload multipart failed
        upload id: fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60
caused by: RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4?partNumber=10&uploadId=fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60": write tcp 172.16.200.144:49530->52.218.108.216:443: write: connection timed out
ERROR: 2023/10/13 07:15:30.005154 Unable to continue the backup process because of the loss of a part 90.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants