Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.8.1 - Volsync Source Alerts not clearing after mutliple trigger runs #1167

Open
reefland opened this issue Mar 12, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@reefland
Copy link

Describe the bug
While not frequent, I have had VolSyncVolumeOutOfSync alert raised with role="source" that does not clear on its own. I have to restart the volsync application pod to clear the alert.

Steps to reproduce
I have a Promtheus alert defined as:

        - alert: VolSyncVolumeOutOfSync
          annotations:
            summary: >-
              {{ $labels.obj_namespace }}/{{ $labels.obj_name }} volume
              is out of sync.
          expr: |
            volsync_volume_out_of_sync == 1
          for: 15m
          labels:
            severity: critical

I didn't notice exactly when the alert was raised. I suspect there was a delay with the initial run due to Restic repository secret issue. But it was definitely before the job's 2nd run:

volsync_volume_out_of_sync{container="kube-rbac-proxy", endpoint="https", instance="10.42.0.186:8443", job="volsync-metrics", method="restic", namespace="volsync-system", obj_name="unifi", obj_namespace="unifi", pod="volsync-6b546cdf59-5knxk", role="source", service="volsync-metrics"} 1

Upon noticing the alert, I checked the replicationsource, which look like initial run was fine:

  Last Sync Duration:      24.08989161s
  Last Sync Time:          2024-03-12T14:14:05Z
  Latest Mover Status:
    Logs:  no parent snapshot found, will read all files
Added to the repository: 390.739 MiB (282.820 MiB stored)
processed 697 files, 944.763 MiB in 0:09
snapshot c3b827c6 saved
Restic completed in 13s
    Result:        Successful
  Next Sync Time:  2024-03-12T16:00:00Z

And next one has not been reached yet:

$ date -u +"%Y-%m-%dT%H-%M-%SZ"
2024-03-12T15-38-06Z

I expected the alert to clear after the next run. I waited for the next run also successful, but alert does not clear:

  Last Sync Duration:      52.147254329s
  Last Sync Time:          2024-03-12T16:00:52Z
  Latest Mover Status:
    Logs:  using parent snapshot c3b827c6
Added to the repository: 31.496 MiB (9.310 MiB stored)
processed 697 files, 935.255 MiB in 0:04
snapshot ee5e4fe3 saved
Restic completed in 5s
    Result:        Successful
  Next Sync Time:  2024-03-12T20:00:00Z

Expected behavior

I was expecting the VolSyncVolumeOutOfSync alert to clear after the next trigger run.

Actual results
The alert did not clear until I manually restarted the volsync application pod. The alert immediately clears and stays cleared.

Additional context
Not sure what is relevant in the volsync pod log. These are logs filtered on keyword unifi which had the raised alert, before being restarted:

2024-03-12T16:00:52.121Z INFO controllers.ReplicationSource job completed {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "job": {"name":"volsync-src-unifi-controller","namespace":"unifi"}}
2024-03-12T16:00:52.126Z INFO controllers.ReplicationSource Getting logs for pod {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "jobName": "volsync-src-unifi-controller", "podName": "volsync-src-unifi-controller-q6x49", "pod": {"namespace": "unifi", "name": "volsync-src-unifi-controller-q6x49"}}
2024-03-12T16:00:52.147Z DEBUG controllers.ReplicationSource transitioning to cleanup state {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}}
2024-03-12T16:00:52.168Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:00:52.168Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:00:52.264Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:00:52.264Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:00:52.305Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:00:52.305Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:00:52.565Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:00:52.565Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:01:00.220Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:01:00.220Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:14:29.403Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T16:14:29.403Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T16:14:29.403Z DEBUG events Populator finished {"type": "Normal", "object": {"kind":"PersistentVolumeClaim","namespace":"unifi","name":"unifi-controller","uid":"ed9f9bce-ee0c-48d9-a6f3-45ee74a4978a","apiVersion":"v1","resourceVersion":"535052418"}, "reason": "VolSyncPopulatorFinished"}

I restart the volsync application pod, the alert immediately clears and does not come back, this again filtered on unifi was after restart:

2024-03-12T17:29:05.990Z DEBUG events Populator finished {"type": "Normal", "object": {"kind":"PersistentVolumeClaim","namespace":"unifi","name":"unifi-controller","uid":"ed9f9bce-ee0c-48d9-a6f3-45ee74a4978a","apiVersion":"v1","resourceVersion":"535052418"}, "reason": "VolSyncPopulatorFinished"}
2024-03-12T17:29:06.099Z INFO controllers.ReplicationSource Namespace allows volsync privileged movers {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T17:29:06.099Z INFO controllers.ReplicationDestination Namespace allows volsync privileged movers {"replicationdestination": {"name":"unifi-controller-dst","namespace":"unifi"}, "namespace": "unifi", "Annotation": "volsync.backube/privileged-movers", "Annotation value": "true"}
2024-03-12T17:29:06.100Z INFO controllers.ReplicationSource deleting temporary objects {"replicationsource": {"name":"unifi-controller","namespace":"unifi"}, "method": "Restic", "owned-by": "ada06fe5-dbdf-4b17-a5d1-52defa9e9cd7"}
2024-03-12T17:29:06.104Z DEBUG controllers.ReplicationDestination removing snapshot annotations from pvc {"replicationdestination": {"name":"unifi-controller-dst","namespace":"unifi"}, "method": "Restic"}
2024-03-12T17:29:06.105Z INFO controllers.ReplicationDestination deleting temporary objects {"replicationdestination": {"name":"unifi-controller-dst","namespace":"unifi"}, "method": "Restic", "owned-by": "ad5c7686-eeb5-4f9b-be49-a614a3817cb2"}
@reefland reefland added the bug Something isn't working label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

1 participant