A lot of reconciling events happens on a short time without any clear reason #944

dorshay6 · 2022-01-30T03:25:02Z

Describe the bug

The cluster or any configuration related to it wasn't changed in the last week, suddenly we saw reconciling logs every few seconds, this caused one of the pods and his queues some downtime.

Help is needed as we don't feel comfortable with the operator in production right now

Finished reconciling |   | 2022-01-30T01:55:24.287622672Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:24.287622672Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:24.287622672Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:24.287622672Z
Finished reconciling |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:14.317850095Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:14.317850095Z
Finished reconciling |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:14.317850095Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:09.291413108Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:09.291413108Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:09.291413108Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:54.293612815Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:54.293612815Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:54:54.293612815Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:39.290037216Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:39.290037216Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:54:39.290037216Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:24.290501390Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:24.290501390Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":

Expected behavior
reconciling should only happen for a reason.

Version and environment information

RabbitMQ: RabbitMQ 3.8.21
RabbitMQ Cluster Operator: rabbitmqoperator/cluster-operator:1.10.0
Kubernetes: 1.20.12
Cloud provider or hardware configuration: GCP

Additional context

A node version upgrade happened in a similar time.

Might be related to #826

Maintainer edit: log as a code block

The text was updated successfully, but these errors were encountered:

Zerpet · 2022-01-31T12:33:05Z

Hey @dorshay6, can you share the manifest of your RabbitmqCluster ?

The reconcile loop triggers when there's an event (create, update, delete) on the RabbitmqCluster, or its owned resources i.e. StatefulSet, Services, ConfigMap, Secrets, etc. Another event that triggers reconcile is when the Operator Pod is recreated.

Did RabbitMQ pods restart as a consequence of those reconcile events?

coro · 2022-01-31T13:55:59Z

From looking at the reconcile logs here it looks like the manifest was an adaptation of the production-ready example:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: production-ready
spec:
  replicas: 4
  image: rabbitmq:3.8.21-management
  service:
    type: LoadBalancer
    annotations:
      "cloud.google.com/load-balancer-type": Internal
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 2
      memory: 4Gi
  rabbitmq:
    additionalConfig: |
      cluster_partition_handling = pause_minority
      vm_memory_high_watermark_paging_ratio = 0.99
      disk_free_limit.relative = 1.0
      collect_statistics_interval = 10000
  persistence:
    storage: "500Gi"
  terminationGracePeriodSeconds: 604800
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - production-ready
        topologyKey: kubernetes.io/hostname
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: "topology.kubernetes.io/zone"
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  app.kubernetes.io/name: production-ready

coro · 2022-01-31T15:02:42Z

I've just attempted reproducing on GKE with the above manifest and triggering a rolling upgrade of the Kubernetes nodes, and can't reproduce. I do see when a node where a RabbitmqCluster Pod is scheduled goes down for upgrade, this triggers several Reconcile loops as the operator receives the delete event for the Pod it owns, and then recreates the Pod. It looks like there's about 4 Reconcile loops triggered per Pod from my tests.

I agree with @Zerpet, I think it's important to know whether the Reconciles happened at the exact same time as the node restarts or not, as we would expect a handful of Reconciles per restart. Also, did the Reconciles continue to occur long after the node restarts? On my system, I'm seeing the logs completely stop after the node upgrades are completed.

github-actions · 2022-04-02T00:11:59Z

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

github-actions · 2022-05-02T00:13:35Z

Closing stale issue due to further inactivity.

dorshay6 added the bug Something isn't working label Jan 30, 2022

github-actions bot added the stale Issue or PR with long period of inactivity label Apr 2, 2022

github-actions bot added the closed-stale Issue or PR closed due to long period of inactivity label May 2, 2022

github-actions bot closed this as completed May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of reconciling events happens on a short time without any clear reason #944

A lot of reconciling events happens on a short time without any clear reason #944

dorshay6 commented Jan 30, 2022 •

edited by Zerpet

Zerpet commented Jan 31, 2022

coro commented Jan 31, 2022 •

edited

coro commented Jan 31, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented May 2, 2022

A lot of reconciling events happens on a short time without any clear reason #944

A lot of reconciling events happens on a short time without any clear reason #944

Comments

dorshay6 commented Jan 30, 2022 • edited by Zerpet

Describe the bug

Version and environment information

Additional context

Zerpet commented Jan 31, 2022

coro commented Jan 31, 2022 • edited

coro commented Jan 31, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented May 2, 2022

dorshay6 commented Jan 30, 2022 •

edited by Zerpet

coro commented Jan 31, 2022 •

edited