New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB Console: Add Replication Lag chart #120652
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
P-2
Issues/test failures with a fix SLA of 3 months
Projects
Comments
dt
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Mar 18, 2024
cc @cockroachdb/disaster-recovery |
dt
added
P-2
Issues/test failures with a fix SLA of 3 months
and removed
T-disaster-recovery
labels
Mar 18, 2024
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
Apr 29, 2024
There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes cockroachdb#120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
Apr 29, 2024
There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes cockroachdb#120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard
craig bot
pushed a commit
that referenced
this issue
Apr 30, 2024
123208: ui: add line graph for replication lag metric r=xinhaoz a=kev-cao There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes #120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
Apr 30, 2024
There currently does not exist observability for replication lag in PCR in the DB Console. As replication lag is essentially RPO for customers, this metric should be made available to them in the dashboard. This commit adds the metric as the difference between the wall time and the reported replicated time. Fixes cockroachdb#120652 Release note (ui change): Added observability for PCR replication lag to the metrics dashboard
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 2, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. Patched by taking the highest replicated time of all the nodes. Also stop reporting replication lag when ingesting has stopped (e.g. cutover or job cancel/fail). Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 2, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. Patched by taking the highest replicated time of all the nodes. Also stop reporting replication lag when ingesting has stopped (e.g. cutover or job cancel/fail). Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 2, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. Patched by taking the highest replicated time of all the nodes. Also stop reporting replication lag when ingesting has stopped (e.g. cutover or job cancel/fail). Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 3, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 3, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
kev-cao
added a commit
to kev-cao/cockroach
that referenced
this issue
May 3, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs cockroachdb#120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
craig bot
pushed a commit
that referenced
this issue
May 3, 2024
123510: ui: fix replication lag metric for multinode clusters and cutover r=msbutler a=kev-cao Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs #120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>
blathers-crl bot
pushed a commit
that referenced
this issue
May 3, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs #120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
blathers-crl bot
pushed a commit
that referenced
this issue
May 3, 2024
Replication lag metric would report absurdly high lag for multinode clusters as it would take the average of the reported timestamps, and as some nodes may report 0, this would cause extremely low replicated times. To resolve this, the metric should pick the maximum time reported by all of the nodes. Additionally, on cutover or job fail/cancellation, replicated time has stopped being reported to avoid falsely reporting high replication lag. Informs #120652 Release note (ui change): fix replication lag metric reporting for multinode clusters and cutover
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
P-2
Issues/test failures with a fix SLA of 3 months
We should add observability for the replication lag in PCR (current time - replicated time, where the src and dst cluster are exactly the same). Since replication lag is essentially the RPO for customers, we should surface this metric, ideally as a chart on the cluster replication dashboard.
This will require rendering computed difference between the wall time and the reported replicated time, since we can only reliably export the replicated time, not the difference. We do not export the computed difference because doing so is susceptible to failing in the worst possible way if the code updating this metric fails/gets stuck/crashes, in that it if it stopped replicating, the replicated time is not advancing and the observed lag should increase but if the metric keeps export the last reported value of lag, before it got stuck, it would misleadingly look fine.
Jira issue: CRDB-36680
Jira issue: CRDB-36805
The text was updated successfully, but these errors were encountered: