Fix PiR if (CLONE_)WALG_DOWNLOAD_CONCURRENCY not set #781

hau21um · 2022-09-29T22:07:12Z

If WALG_DOWNLOAD_CONCURRENCY env var is not set, restore/clone will fail just after basebackup, while trying to do Point in Time recovery.

POOL_SIZE script variable is set from WALG_DOWNLOAD_CONCURRENCY

Cause for the failure is that restore_command.sh script would call wal-fetch with -p $POOL_SIZE, but as POOL_SIZE has no value, it will fail.

This MR should arrange that "-p" will be used only if POOL_SIZE is defined and is not empty.

If for any reason WALG_DOWNLOAD_CONCURRENCY is not set, then POOL_SIZE is also not populated. In case of restore/cloning this lead to following errors, just after successful restore of the base backup tar files, like: ------------------------------ (stdout): wal_e.worker.s3.s3_worker INFO MSG: beginning partition download DETAIL: The partition being downloaded is part_00000343.tar.lzo. HINT: The absolute S3 key is spilo/xxxxxxx/wal/13/basebackups_005/base_00000021000000EE00000009_00000040/tar_partitions/part_00000343.tar.lzo. STRUCTURED: time=2022-09-29T17:45:57.018619-00 pid=75 2022-09-29 17:46:01,904 INFO: Lock owner: None; I am xxxxxxxx-cluster-0 2022-09-29 17:46:01,904 INFO: not healthy enough for leader race 2022-09-29 17:46:01,904 INFO: bootstrap in progress . . 2022-09-29 17:46:27,465 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload. 2022-09-29 17:46:27,492 maybe_pg_upgrade INFO: Cluster version: 13, bin version: 13 2022-09-29 17:46:27,492 maybe_pg_upgrade INFO: Trying to perform point-in-time recovery 2022-09-29 17:46:27,493 maybe_pg_upgrade INFO: Running custom bootstrap script: true 2022-09-29 17:46:27,510 maybe_pg_upgrade WARNING: Removing unexpected parameter=slow_query_log value=off from the config 2022-09-29 17:46:27.716 UTC user= db= pid=4046 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2022-09-29 17:46:27.717 UTC user= db= pid=4046 LOG: pg_stat_kcache.linux_hz is set to 500000 2022-09-29 17:46:27,721 maybe_pg_upgrade INFO: postmaster pid=4046 /var/run/postgresql:5432 - no response 2022-09-29 17:46:27.889 UTC user= db= pid=4046 LOG: redirecting log output to logging collector process 2022-09-29 17:46:27.889 UTC user= db= pid=4046 HINT: Future log output will appear in directory "../pg_log". 2022-09-29 17:46:28,746 maybe_pg_upgrade ERROR: postmaster is not running Traceback (most recent call last): File "/scripts/maybe_pg_upgrade.py", line 51, in perform_pitr raise Exception('Point-in-time recovery failed') Exception: Point-in-time recovery failed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/scripts/maybe_pg_upgrade.py", line 140, in <module> main() File "/scripts/maybe_pg_upgrade.py", line 87, in main perform_pitr(upgrade, cluster_version, bin_version, config['bootstrap']) File "/scripts/maybe_pg_upgrade.py", line 70, in perform_pitr raise Exception('Point-in-time recovery failed.\nLOGS:\n--\n' + logs) Exception: Point-in-time recovery failed. (postgresql-X.csv): 2022-09-29 17:46:27.889 UTC,,,4046,,6335d9f3.fce,1,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"","postmaster" 2022-09-29 17:46:27.889 UTC,,,4046,,6335d9f3.fce,2,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"starting PostgreSQL 13.7 (Ubuntu 13.7-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit",,,,,,,,,"","postmaster" 2022-09-29 17:46:27.889 UTC,,,4046,,6335d9f3.fce,3,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432",,,,,,,,,"","postmaster" 2022-09-29 17:46:27.889 UTC,,,4046,,6335d9f3.fce,4,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"listening on IPv6 address ""::"", port 5432",,,,,,,,,"","postmaster" 2022-09-29 17:46:27.894 UTC,,,4046,,6335d9f3.fce,5,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432""",,,,,,,,,"","postmaster" 2022-09-29 17:46:27.901 UTC,,,4049,,6335d9f3.fd1,1,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"database system was interrupted; last known up at 2022-09-28 13:10:06 UTC",,,,,,,,,"","startup" 2022-09-29 17:46:27.901 UTC,,,4049,,6335d9f3.fd1,2,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"creating missing WAL directory ""pg_wal/archive_status""",,,,,,,,,"","startup" 2022-09-29 17:46:28.192 UTC,,,4049,,6335d9f3.fd1,3,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"starting point-in-time recovery to 2022-09-29 00:00:00.666+00",,,,,,,,,"","startup" 2022-09-29 17:46:28.222 UTC,,,4049,,6335d9f3.fd1,4,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"invalid checkpoint record",,,,,,,,,"","startup" 2022-09-29 17:46:28.222 UTC,,,4049,,6335d9f3.fd1,5,,2022-09-29 17:46:27 UTC,,0,FATAL,XX000,"could not locate required checkpoint record",,"If you are restoring from a backup, touch ""/home/postgres/pgdata/pgroot/data/recovery.signal"" and add required recovery options. If you are not restoring from a backup, try removing the file ""/home/postgres/pgdata/pgroot/data/backup_label"". Be careful: removing ""/home/postgres/pgdata/pgroot/data/backup_label"" will result in a corrupt cluster if restoring from a backup.",,,,,,,"","startup" 2022-09-29 17:46:28.224 UTC,,,4046,,6335d9f3.fce,6,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"startup process (PID 4049) exited with exit code 1",,,,,,,,,"","postmaster" 2022-09-29 17:46:28.224 UTC,,,4046,,6335d9f3.fce,7,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,"","postmaster" 2022-09-29 17:46:28.229 UTC,,,4046,,6335d9f3.fce,8,,2022-09-29 17:46:27 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster" (postgresql-X.log): 2022-09-29 17:46:27.889 UTC user= db= pid=4046 LOG: ending log output to stderr 2022-09-29 17:46:27.889 UTC user= db= pid=4046 HINT: Future log output will go to log destination "csvlog". Thu Sep 29 17:46:28 UTC 2022 Usage: /scripts/wal-e-wal-fetch.sh wal-fetch [--prefetch PREFETCH] WAL_SEGMENT WAL_DESTINATION Thu Sep 29 17:46:28 UTC 2022 Usage: /scripts/wal-e-wal-fetch.sh wal-fetch [--prefetch PREFETCH] WAL_SEGMENT WAL_DESTINATION Thu Sep 29 17:46:28 UTC 2022 Usage: /scripts/wal-e-wal-fetch.sh wal-fetch [--prefetch PREFETCH] WAL_SEGMENT WAL_DESTINATION ------------------------------ Without this patch setting CLONE_WALG_DOWNLOAD_CONCURRENCY env in PG manifest would be workaround.

hau21um added 2 commits September 29, 2022 22:20

One more conditional -p usage

b4960e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PiR if (CLONE_)WALG_DOWNLOAD_CONCURRENCY not set #781

Fix PiR if (CLONE_)WALG_DOWNLOAD_CONCURRENCY not set #781

hau21um commented Sep 29, 2022

Fix PiR if (CLONE_)WALG_DOWNLOAD_CONCURRENCY not set #781

Are you sure you want to change the base?

Fix PiR if (CLONE_)WALG_DOWNLOAD_CONCURRENCY not set #781

Conversation

hau21um commented Sep 29, 2022