Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Console: Add Replication Lag chart #120652

Closed
dt opened this issue Mar 18, 2024 · 1 comment · Fixed by #123208
Closed

DB Console: Add Replication Lag chart #120652

dt opened this issue Mar 18, 2024 · 1 comment · Fixed by #123208
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months

Comments

@dt
Copy link
Member

dt commented Mar 18, 2024

We should add observability for the replication lag in PCR (current time - replicated time, where the src and dst cluster are exactly the same). Since replication lag is essentially the RPO for customers, we should surface this metric, ideally as a chart on the cluster replication dashboard.

This will require rendering computed difference between the wall time and the reported replicated time, since we can only reliably export the replicated time, not the difference. We do not export the computed difference because doing so is susceptible to failing in the worst possible way if the code updating this metric fails/gets stuck/crashes, in that it if it stopped replicating, the replicated time is not advancing and the observed lag should increase but if the metric keeps export the last reported value of lag, before it got stuck, it would misleadingly look fine.

Jira issue: CRDB-36680

Jira issue: CRDB-36805

@dt dt added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 18, 2024
@dt dt added this to Backlog in Disaster Recovery Backlog via automation Mar 18, 2024
Copy link

blathers-crl bot commented Mar 18, 2024

cc @cockroachdb/disaster-recovery

@dt dt added P-2 Issues/test failures with a fix SLA of 3 months and removed T-disaster-recovery labels Mar 18, 2024
kev-cao added a commit to kev-cao/cockroach that referenced this issue Apr 29, 2024
There currently does not exist observability for replication lag in PCR
in the DB Console. As replication lag is essentially RPO for customers,
this metric should be made available to them in the dashboard. This
commit adds the metric as the difference between the wall time and the
reported replicated time.

Fixes cockroachdb#120652

Release note (ui change): Added observability for PCR replication lag
to the metrics dashboard
kev-cao added a commit to kev-cao/cockroach that referenced this issue Apr 29, 2024
There currently does not exist observability for replication lag in PCR
in the DB Console. As replication lag is essentially RPO for customers,
this metric should be made available to them in the dashboard. This
commit adds the metric as the difference between the wall time and the
reported replicated time.

Fixes cockroachdb#120652

Release note (ui change): Added observability for PCR replication lag
to the metrics dashboard
craig bot pushed a commit that referenced this issue Apr 30, 2024
123208: ui: add line graph for replication lag metric r=xinhaoz a=kev-cao

There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time.

Fixes #120652

Release note (ui change): Added observability for PCR replication lag to the metrics dashboard



Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>
@craig craig bot closed this as completed in a9dfc64 Apr 30, 2024
Disaster Recovery Backlog automation moved this from Backlog to Done Apr 30, 2024
kev-cao added a commit to kev-cao/cockroach that referenced this issue Apr 30, 2024
There currently does not exist observability for replication lag in PCR
in the DB Console. As replication lag is essentially RPO for customers,
this metric should be made available to them in the dashboard. This
commit adds the metric as the difference between the wall time and the
reported replicated time.

Fixes cockroachdb#120652

Release note (ui change): Added observability for PCR replication lag
to the metrics dashboard
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 2, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. Patched by taking the highest replicated time of all the nodes.
Also stop reporting replication lag when ingesting has stopped (e.g.
cutover or job cancel/fail).

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 2, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. Patched by taking the highest replicated time of all the nodes.
Also stop reporting replication lag when ingesting has stopped (e.g.
cutover or job cancel/fail).

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 2, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. Patched by taking the highest replicated time of all the nodes.
Also stop reporting replication lag when ingesting has stopped (e.g.
cutover or job cancel/fail).

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 3, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. To resolve this, the metric should pick the maximum time reported
by all of the nodes. Additionally, on cutover or job fail/cancellation,
replicated time has stopped being reported to avoid falsely reporting
high replication lag.

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 3, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. To resolve this, the metric should pick the maximum time reported
by all of the nodes. Additionally, on cutover or job fail/cancellation,
replicated time has stopped being reported to avoid falsely reporting
high replication lag.

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
kev-cao added a commit to kev-cao/cockroach that referenced this issue May 3, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. To resolve this, the metric should pick the maximum time reported
by all of the nodes. Additionally, on cutover or job fail/cancellation,
replicated time has stopped being reported to avoid falsely reporting
high replication lag.

Informs cockroachdb#120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
craig bot pushed a commit that referenced this issue May 3, 2024
123510: ui: fix replication lag metric for multinode clusters and cutover r=msbutler a=kev-cao

Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag.

Informs #120652

Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover

Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue May 3, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. To resolve this, the metric should pick the maximum time reported
by all of the nodes. Additionally, on cutover or job fail/cancellation,
replicated time has stopped being reported to avoid falsely reporting
high replication lag.

Informs #120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
blathers-crl bot pushed a commit that referenced this issue May 3, 2024
Replication lag metric would report absurdly high lag for multinode
clusters as it would take the average of the reported timestamps, and as
some nodes may report 0, this would cause extremely low replicated
times. To resolve this, the metric should pick the maximum time reported
by all of the nodes. Additionally, on cutover or job fail/cancellation,
replicated time has stopped being reported to avoid falsely reporting
high replication lag.

Informs #120652

Release note (ui change): fix replication lag metric reporting for multinode
clusters and cutover
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months
Development

Successfully merging a pull request may close this issue.

1 participant