You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag)
Use case. Why is this important?
For simple primary:replica setups it would be convenient to be able to monitor standby lag in seconds from the primary. pg_stat_replication already includes this information on postgres>=10, but the exporter does not parse it. AFAIK, workarounds would include
monitoring byte offset from pg_stat_replication.write_lsn|flush_lsn|replay_lsn), but this does not quite capture issues where the replica has not replayed a business critical change however small in bytes.
monitoring delay on the replica, but this would require spinning up a separate monitor only to monitor a single value on the replica.
Notes
I'm assuming the exporter does not support parsing the interval data type, which is why these metrics are marked as DISCARDhere. I wonder if the following, explicit approach would be appropriate in this case:
diff --git a/cmd/postgres_exporter/queries.go b/cmd/postgres_exporter/queries.go
index fa0b5c2..e28d7b4 100644
--- a/cmd/postgres_exporter/queries.go+++ b/cmd/postgres_exporter/queries.go@@ -53,7 +53,10 @@ var queryOverrides = map[string][]OverrideQuery{
SELECT *,
(case pg_is_in_recovery() when 't' then null else pg_current_wal_lsn() end) AS pg_current_wal_lsn,
(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn('0/0'))::float end) AS pg_current_wal_lsn_bytes,
- (case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff+ (case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff,+ (case pg_is_in_recovery() when 't' then null else extract(epoch from write_lag) end) as write_lag_seconds,+ (case pg_is_in_recovery() when 't' then null else extract(epoch from flush_lag) end) as flush_lag_seconds,+ (case pg_is_in_recovery() when 't' then null else extract(epoch from replay_lag) end) as replay_lag_seconds
FROM pg_stat_replication
`,
},
That way the deviation from the official fields is not in conflict with the source data.
The text was updated successfully, but these errors were encountered:
Just a note for anyone looking into this. epoch should be extracted from the interval instead of seconds, since extracting seconds will only extract the seconds "component" of the interval:
(case pg_is_in_recovery() when 't' then null else extract(epoch from write_lag) end) as write_lag_seconds,
(case pg_is_in_recovery() when 't' then null else extract(epoch from flush_lag) end) as flush_lag_seconds,
(case pg_is_in_recovery() when 't' then null else extract(epoch from replay_lag) end) as replay_lag_seconds
Just a note for anyone looking into this. epoch should be extracted from the interval instead of seconds, since extracting seconds will only extract the seconds "component" of the interval:
Add support for
pg_stat_replication.(write_lag|flush_lag|replay_lag)
Use case. Why is this important?
For simple primary:replica setups it would be convenient to be able to monitor standby lag in seconds from the primary.
pg_stat_replication
already includes this information on postgres>=10, but the exporter does not parse it. AFAIK, workarounds would includepg_stat_replication.write_lsn|flush_lsn|replay_lsn)
, but this does not quite capture issues where the replica has not replayed a business critical change however small in bytes.Notes
I'm assuming the exporter does not support parsing the
interval
data type, which is why these metrics are marked asDISCARD
here. I wonder if the following, explicit approach would be appropriate in this case:That way the deviation from the official fields is not in conflict with the source data.
The text was updated successfully, but these errors were encountered: