Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

StrikeW · 2024-05-10T11:24:38Z

Describe the bug

A customer report that the replication slot on upstream PG gradually grow its size even though there are no activities to the source tables for a long time.

// pg_replication_slots
-[ RECORD 1 ]-------+-----------------------
slot_name           | risingwave_slot
plugin              | pgoutput
slot_type           | logical
datoid              | 16399
database            | test_db
temporary           | f
active              | t
active_pid          | 25233
xmin                | 
catalog_xmin        | 789484419
restart_lsn         | 62A/ABCB4068
confirmed_flush_lsn | 62A/F2E66B38
wal_status          | extended
safe_wal_size       | 
two_phase           | f

-[ RECORD 1 ]------+-------------
pg_current_wal_lsn | 62F/2C000248

SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS raw, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) AS replicationSlotLag, active FROM pg_replication_slots;
-[ RECORD 1 ]------+---------------
slot_name          | risingwave_slot
raw                | 20203225568
replicationslotlag | 19 GB
active             | t

// pg_stat_replication
-[ RECORD 1 ]----+----------------------------------
pid              | 25233
usesysid         | 16446
usename          | myuser
application_name | Debezium Streaming
client_addr      | 0.0.0.0
client_hostname  | 
client_port      | 42455
backend_start    | 2024-05-07 14:42:41.812152+00
backend_xmin     | 
state            | streaming
sent_lsn         | 62F/2C000248
write_lsn        | 62F/2C000248
flush_lsn        | 62A/F2E66B38
replay_lsn       | 62A/F2E66B38
write_lag        | 00:00:05.10841
flush_lag        | 00:05:09.198305
replay_lag       | 00:05:09.198305
sync_priority    | 0
sync_state       | async
reply_time       | 1999-12-21 01:31:29.294857+00

We can see that the difference between current_wal_lsn and the restart_lsn is about 19GB.
And the strange thing is that the emitted offset in the Debezium heartbeat message doesn't increase at all. We rely on the heartbeat message to acknowledge the source offset to the upstream PG to reclaim the WAL (#16058).

2024-05-08 13:43:19.051 {"timestamp":"2024-05-08T05:43:19.051443187Z","level":"INFO","fields":{"message":"identity cache miss occurred; added new identity (took Ok(4.478µs))"},"target":"aws_smithy_runtime::client::identity::cache::lazy","spans":[{"name":"lazy_load_identity"},{"name":"lazy_load_identity"}]}
2024-05-08 13:43:19.113 {"timestamp":"2024-05-08T05:43:19.113843166Z","level":"INFO","fields":{"message":"identity cache miss occurred; added new identity (took Ok(62.534918ms))"},"target":"aws_smithy_runtime::client::identity::cache::lazy","spans":[{"name":"lazy_load_identity"}]}
2024-05-08 13:43:19.168 {"timestamp":"2024-05-08T05:43:19.168124103Z","level":"INFO","fields":{"message":"Engine#11: committed offset DebeziumOffset{isHeartbeat=true, sourcePartition={server=RW_CDC_11}, sourceOffset={transaction_id=null, lsn_proc=6781533580088, messageType=COMMIT, lsn_commit=6781533580088, lsn=6781533580088, txId=790011769, ts_usec=1715100747672533}}","thread":"Thread-55573","class":"com.risingwave.connector.source.core.JniDbzSourceHandler"},"target":"risingwave_connector_node"}
2024-05-08 13:48:20.102 {"timestamp":"2024-05-08T05:48:20.102139457Z","level":"INFO","fields":{"message":"Engine#11: committed offset DebeziumOffset{isHeartbeat=true, sourcePartition={server=RW_CDC_11}, sourceOffset={transaction_id=null, lsn_proc=6781533580088, messageType=COMMIT, lsn_commit=6781533580088, lsn=6781533580088, txId=790011769, ts_usec=1715100747672533}}","thread":"Thread-55875","class":"com.risingwave.connector.source.core.JniDbzSourceHandler"},"target":"risingwave_connector_node"}

p.s 6781533580088 => 62A/F2E66B38

The doc of Debezium has some guidelines related to WAL disk consumption. The following one is the most likely one:

The PostgreSQL instance contains multiple databases and one of them is a high-traffic database. Debezium captures changes in another database that is low-traffic in comparison to the other database.

But the customer has confirmed that there is only one database in their postgres instance, all tables are located in the same database, so there doesn't exist a high-traffic database.

Fortunately, when there are activities occurred to the source tables, the WAL can be reclaimed since we acknowledge newer offset to the PG.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

yuhao-su · 2024-05-10T16:29:50Z

The PostgreSQL instance contains multiple databases and one of them is a high-traffic database. Debezium captures changes in another database that is low-traffic in comparison to the other database.

Isn't it already solved by the heartbeat message?

fuyufjh · 2024-05-13T03:58:03Z

The PostgreSQL instance contains multiple databases and one of them is a high-traffic database. Debezium captures changes in another database that is low-traffic in comparison to the other database.

This sounds like the most likely cause. I am suspicious whether the customer mistake something.

StrikeW · 2024-05-13T04:03:10Z

I am trying to reproduce the issue, it seems AWS RDS would generate WALs behind the scene:

For users on AWS RDS with PostgreSQL, a situation similar to the high traffic/low traffic scenario can occur in an idle environment. AWS RDS causes writes to its own system tables to be invisible to clients on a frequent basis (5 minutes). Again, regularly emitting events solves the problem.

StrikeW · 2024-05-13T09:33:52Z

I have reproduced the issue and confirmed it is caused by AWS generating writes into its hidden database. More details can be found in this blog.

solution: user can configure the heartbeat.action.query via the with option

create source rds_pg with (
 connector = 'postgres-cdc',
 ...
 debezium.heartbeat.action.query = 'SELECT pg_logical_emit_message(false, ''heartbeat'', now()::varchar)'
);

agrobbin · 2024-05-13T13:04:55Z

@StrikeW according to this post, it looks like recent versions (v42.7.0+) of the PgJDBC driver may have resolved this issue (pgjdbc/pgjdbc#2941). Is it possible the version RW depends on needs to be updated?

StrikeW · 2024-05-13T13:59:53Z

@StrikeW according to this post, it looks like recent versions (v42.7.0+) of the PgJDBC driver may have resolved this issue (pgjdbc/pgjdbc#2941). Is it possible the version RW depends on needs to be updated?

Upgrade the JDBC driver may introduce other issues as pointed out in pgjdbc/pgjdbc#3138, I suggest we should leverage the heartbeat.action.query provided by Debezium to periodically emit traffic to the captured database.

agrobbin · 2024-05-13T14:45:24Z

Ohhh good to know! @StrikeW is that something we should do in our own source configuration, or do you plan to do it globally across all PG CDC source connectors?

StrikeW · 2024-05-13T15:02:53Z

Ohhh good to know! @StrikeW is that something we should do in our own source configuration, or do you plan to do it globally across all PG CDC source connectors?

I think we can address the problem in the PG CDC connector internally (via the heartbeat.action.query mechanism), so that users don't need to worry about their disk space. And it is not harm to periodically emit a tiny message to the WAL. cc @hzxa21

agrobbin · 2024-05-13T15:03:54Z

Fantastic! Will patiently await a fix upstream. ❤️

StrikeW added the type/bug Something isn't working label May 10, 2024

github-actions bot added this to the release-1.10 milestone May 10, 2024

StrikeW linked a pull request May 22, 2024 that will close this issue

fix(cdc): add heartbeat action query to solve WAL accumulation on AWS #16888

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

StrikeW commented May 10, 2024 •

edited

yuhao-su commented May 10, 2024 •

edited

fuyufjh commented May 13, 2024

StrikeW commented May 13, 2024

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024 •

edited

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024

Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

Postgres-cdc: the size of WAL for a replication slot accumulated for a long time #16697

Comments

StrikeW commented May 10, 2024 • edited

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

yuhao-su commented May 10, 2024 • edited

fuyufjh commented May 13, 2024

StrikeW commented May 13, 2024

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024 • edited

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024

StrikeW commented May 13, 2024

agrobbin commented May 13, 2024

StrikeW commented May 10, 2024 •

edited

yuhao-su commented May 10, 2024 •

edited

agrobbin commented May 13, 2024 •

edited