DB Console: Add Replication Lag chart #120652

dt · 2024-03-18T16:04:11Z

We should add observability for the replication lag in PCR (current time - replicated time, where the src and dst cluster are exactly the same). Since replication lag is essentially the RPO for customers, we should surface this metric, ideally as a chart on the cluster replication dashboard.

This will require rendering computed difference between the wall time and the reported replicated time, since we can only reliably export the replicated time, not the difference. We do not export the computed difference because doing so is susceptible to failing in the worst possible way if the code updating this metric fails/gets stuck/crashes, in that it if it stopped replicating, the replicated time is not advancing and the observed lag should increase but if the metric keeps export the last reported value of lag, before it got stuck, it would misleadingly look fine.

Jira issue: CRDB-36680

Jira issue: CRDB-36805

blathers-crl · 2024-03-18T16:04:56Z

cc @cockroachdb/disaster-recovery

There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes cockroachdb#120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard

123208: ui: add line graph for replication lag metric r=xinhaoz a=kev-cao There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes #120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>

There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes cockroachdb#120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard

Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. Patched by taking the highest replicated time of all the nodes. Also stop reporting replication lag when ingesting has stopped (e.g. cutover or job cancel/fail). Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover

Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover

123510: ui: fix replication lag metric for multinode clusters and cutover r=msbutler a=kev-cao Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs #120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>

Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs #120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover

dt added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 18, 2024

dt added this to Backlog in Disaster Recovery Backlog via automation Mar 18, 2024

blathers-crl bot added the T-disaster-recovery label Mar 18, 2024

dt added P-2 Issues/test failures with a fix SLA of 3 months and removed T-disaster-recovery labels Mar 18, 2024

kev-cao mentioned this issue Apr 29, 2024

ui: add line graph for replication lag metric #123208

Merged

craig bot closed this as completed in a9dfc64 Apr 30, 2024

Disaster Recovery Backlog automation moved this from Backlog to Done Apr 30, 2024

kev-cao mentioned this issue Apr 30, 2024

release-24.1: ui: add line graph for replication lag metric #123285

Merged

kev-cao mentioned this issue May 2, 2024

ui: fix replication lag metric for multinode clusters and cutover #123510

Merged

blathers-crl bot mentioned this issue May 3, 2024

release-24.1: ui: fix replication lag metric for multinode clusters and cutover #123585

Merged

blathers-crl bot mentioned this issue May 3, 2024

release-24.1.0-rc: ui: fix replication lag metric for multinode clusters and cutover #123586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB Console: Add Replication Lag chart #120652

DB Console: Add Replication Lag chart #120652

dt commented Mar 18, 2024 •

edited

blathers-crl bot commented Mar 18, 2024

DB Console: Add Replication Lag chart #120652

DB Console: Add Replication Lag chart #120652

Comments

dt commented Mar 18, 2024 • edited

blathers-crl bot commented Mar 18, 2024

dt commented Mar 18, 2024 •

edited