Boostrapping multi-terrabyte cluster too slow with pg_dump #4221
Replies: 6 comments
-
Hi @rochecompaan, Just one clarification. First, I am not very familiar with WAL-G so I apologize if I give you wrong information. Per my understanding, WAL-G is a physical backup and recovery technology, similar to Barman (which I started many years ago) or PgBackrest. You are mentioning Supporting what you are asking (restore from any phisical backup) is not an easy task and unfortunately it is not guaranteed to work under several circumstances, due to potential (and unfixable on our end) different assumptions on the way Postgres is used. For example, think about the name of the predefined user for streaming replication, or some fixed settings that the operator requires and might not be compatible with the origin system (be it bare metal, VM or a different operator). But you are not alone in this use case. Last week I was at Google Next and talked to several people that wanted to move their database from outside Kubernetes (including solutions like RDS) into Kubernetes on Postgres managed by CloudNativePG, without any cutover time. I personally think that the road of physical backup and recovery is long and winding. As time passes, I more and more believe that the approach to follow is the one I recently summed up in a blog article: https://www.enterprisedb.com/blog/current-state-major-postgresql-upgrades-cloudnativepg-kubernetes Please let me know if that resonates with you. Cheers, |
Beta Was this translation helpful? Give feedback.
-
@gbartolini, thanks for your feedback! Yes, WAL-G is a physical backup tool, just like Barman, but let me clarify how pg_dump comes into play. I only used WAL-G to set up a temporary Postgres instance running inside the Kubernetes cluster. I created a Kubernetes Job to use WAL-G to restore to a persistent volume. Once the job was completed, I mounted the persistent volume inside the Postgres instance. At this point I had a running instance that one can use to bootstrap a new CNPG cluster using
As per the docs:
So, I'm referring to how the operator uses pg_dump to bootstrap the cluster, not my direct use. I had a look at the options used, and it doesn't support running multiple processes: https://github.com/cloudnative-pg/cloudnative-pg/blob/main/pkg/management/postgres/logicalimport/database.go#L102 I'm heading over to read your article next. I just wanted to clarify my use of pg_dump first. |
Beta Was this translation helpful? Give feedback.
-
@gbartolini great article! Using Postgres publication and subscription seems very promising as a way to migrate an existing database into CNPG! I am a little concerned that sequences are not migrated but I'm sure it can be managed by treading carefullly. That said, I don't think that this method will help bootstrap a big database in a reasonable time. I imagine restoring a full base backup from a primary and replaying WAL files afterwards will still be significantly faster than any alternative, purely because restoring the base backup can be sped up by running concurrent processes. |
Beta Was this translation helpful? Give feedback.
-
If you are not interested in the migration part, but just start a replica from an existing cluster, have you looked into the new volume snapshot feature that we will introduce in version 1.21? You can actually create a replica cluster from an existing volume snapshot that you make available in another region, and bootstrap in a shorter time by fetching WAL files from an object store. See commit c2ca044. If you are interested, you can already test this feature by installing the latest available snapshot. |
Beta Was this translation helpful? Give feedback.
-
This is a discussion more than an issue now, turning into a discusison |
Beta Was this translation helpful? Give feedback.
-
Good idea. Also, I suggest that use this article now as a possibility for this use case: https://gabrielebartolini.it/articles/2024/03/cloudnativepg-recipe-5-how-to-migrate-your-postgresql-database-in-kubernetes-with-~0-downtime-from-anywhere/ |
Beta Was this translation helpful? Give feedback.
-
I set up a temporary Postgres deployment to bootstrap a new cluster according to #2638. The total size of the database is 6.2 terabytes. After roughly one day, only 300 GB was restored. For comparison, the wal-g restore took about 6 hours, and this was only possible because I could increase the concurrency to 100. For CPNG, the spec doesn't currently allow increasing the number of pg_dump jobs, e.g.,
pg_dump -j njobs
.I suspect that even increasing the number of pg_dump jobs won't help me to restore a database this size in a reasonable time (hours instead of days), but I would be happy to try if such an option becomes available.
I used an unorthodox workaround that might be useful to someone else trying to bootstrap a new cluster from a big database:
It would be great if the docs could address bootstrapping a new cluster from a physical backup in the future, but allowing one to increase the number of pg_dump jobs would already be a big help.
Beta Was this translation helpful? Give feedback.
All reactions