Fast PG catchup #1629

x4m · 2024-01-21T18:09:44Z

In this issue, I want to suggest some enhancements to the existing PG catchup feature.

Problem Statement

If you have a streaming replication cluster, standby servers might lag behind the primary. If this lag is in the order of days, a reasonable solution is to recreate standby from a fresh backup.

However, retrieving the whole backup can take a long time, especially in heavily loaded clusters. For this situation, WAL-G provides PG catchup, which is a special type of backup that can be performed on top of an existing stopped standby server. WAL-G handles two distinct worlds: the currently running Postgres instance and the storage. The catchup backup must be stored in storage, and then fetched by another instance. However, the catchup via storage doubles the number of bytes that need to be transferred, making the catchup process twice as slow. This speed is crucial for the catchup. That's why we need a faster catchup that does not require pushing the backup to storage. We want the catchup to push the backup directly to the standby node.

Here are my proposed solutions:

Proposed Solution 1: Creating a Special Storage for Catchup

To avoid pushing the backup onto storage, we can create a special storage for catchup called CatchupStorage (CS). Let's call it CS for short. CS has several settings, including concurrent connections (concurrency) and standby hostname and port. When wal-g catchup-fetch is run against CS, it knows the concurrency setting, so it can return a list of expected tars to CS. CS opens a port when configured, and accepts incoming connections from wal-g catchup-push. It implements ObjectPut() as the outgoing connection, transferring the name and contents of the object. When catchup wants to download a file from CS, it accepts all incoming connections and reads the names of objects from the connections. ObjectGet() picks the connection corresponding to the requested object name. Data transfer between the CS source and destination is encrypted using the existing method of tar encryption; however, tar names are hardcoded and do not need encryption.

The implementation of this solution is relatively simple, but its overall design is complex.

Proposal Solution 2: Implementing New Commands for Catchup Transfer

We can implement new commands for catching up without using storage at all: wal-g catchup-send and wal-g catchup-recieve. These commands perform the same functions, but they do not use storage at all.

Open questions

What if catchup was not applied successfully? Can we retry this operation?
Can we have only one TCP connection between Primary and Standby? I think we need many connections to utilize network efficiently. But one of the objectives might be avoid Primary starvation, so maybe one connection is reasonable too.
If we have many TCP connections we must ensure data from all streams was actually read.

@usernamedt What do you think? Which path to take?

The text was updated successfully, but these errors were encountered:

vbp1 · 2024-01-24T15:28:48Z

What if catchup was not applied successfully? Can we retry this operation?

retrying failed/canceled catchup from the point where it was interrupted is very important, we could avoid retransmitting data

Per design docs in wal-g#1629, but with significant changes (2nd approach). This PR allows to use catchup without pushing to storage.

* Catchup-send and catchup-receive commands Per design docs in #1629, but with significant changes (2nd approach). This PR allows to use catchup without pushing to storage. * Remove debug stuff * Some errors handling * Fix review issues * Minor refactoring * Remove unnecesary files * Refactor sending file * typo * Compression implementation * Calm goling * Fix unit test * Fix one more test * Refactor * Encryption * Enable diff back * Refactor a bit * Formatting * Minimal docs --------- Co-authored-by: Andrey M. Borodin <x4mmm@night.local> Co-authored-by: Andrey M. Borodin <x4mmm@172.25.72.30-ekb.dhcp.yndx.net>

x4m pushed a commit to x4m/wal-g that referenced this issue Mar 4, 2024

Catchup-send and catchup-receive commands

a976790

Per design docs in wal-g#1629, but with significant changes (2nd approach). This PR allows to use catchup without pushing to storage.

x4m mentioned this issue Mar 4, 2024

Catchup-send and catchup-receive commands #1652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast PG catchup #1629

Fast PG catchup #1629

x4m commented Jan 21, 2024

vbp1 commented Jan 24, 2024 •

edited

Fast PG catchup #1629

Fast PG catchup #1629

Comments

x4m commented Jan 21, 2024

Problem Statement

Proposed Solution 1: Creating a Special Storage for Catchup

Proposal Solution 2: Implementing New Commands for Catchup Transfer

Open questions

vbp1 commented Jan 24, 2024 • edited

vbp1 commented Jan 24, 2024 •

edited