Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of reconciling events happens on a short time without any clear reason #944

Closed
dorshay6 opened this issue Jan 30, 2022 · 5 comments
Closed
Labels
bug Something isn't working closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity

Comments

@dorshay6
Copy link

dorshay6 commented Jan 30, 2022

Describe the bug

The cluster or any configuration related to it wasn't changed in the last week, suddenly we saw reconciling logs every few seconds, this caused one of the pods and his queues some downtime.

Help is needed as we don't feel comfortable with the operator in production right now

Finished reconciling |   | 2022-01-30T01:55:24.287622672Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:24.287622672Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:24.287622672Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:24.287622672Z
Finished reconciling |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:14.317850095Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:14.317850095Z
Finished reconciling |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:14.317850095Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:14.317850095Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:55:09.291413108Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:55:09.291413108Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:55:09.291413108Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:54.293612815Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:54.293612815Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:54:54.293612815Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:39.290037216Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:39.290037216Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":{"matchLabels":{"app.kubernetes.io/name":"production-ready"}}}]}}}}},"terminationGracePeriodSeconds":604800,"secretBackend":{}} | 2022-01-30T01:54:39.290037216Z
updated resource rabbitmq-server of Type   *v1.StatefulSet |   | 2022-01-30T01:54:24.290501390Z
updated resource rabbitmq-nodes of Type   *v1.Service |   | 2022-01-30T01:54:24.290501390Z
Start reconciling | {"replicas":4,"image":"rabbitmq:3.8.21-management","service":{"type":"LoadBalancer","annotations":{"cloud.google.com/load-balancer-type":"Internal"}},"persistence":{"storage":"500Gi"},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"2","memory":"4Gi"}},"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["production-ready"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"rabbitmq":{"additionalConfig":"cluster_partition_handling   = pause_minority\nvm_memory_high_watermark_paging_ratio =   0.99\ndisk_free_limit.relative = 1.0\ncollect_statistics_interval =   10000\n"},"tls":{},"override":{"statefulSet":{"spec":{"template":{"spec":{"containers":[],"topologySpreadConstraints":[{"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"DoNotSchedule","labelSelector":

Expected behavior
reconciling should only happen for a reason.

Version and environment information

  • RabbitMQ: RabbitMQ 3.8.21
  • RabbitMQ Cluster Operator: rabbitmqoperator/cluster-operator:1.10.0
  • Kubernetes: 1.20.12
  • Cloud provider or hardware configuration: GCP

Additional context

A node version upgrade happened in a similar time.

Might be related to #826

Maintainer edit: log as a code block

@dorshay6 dorshay6 added the bug Something isn't working label Jan 30, 2022
@Zerpet
Copy link
Collaborator

Zerpet commented Jan 31, 2022

Hey @dorshay6, can you share the manifest of your RabbitmqCluster ?

The reconcile loop triggers when there's an event (create, update, delete) on the RabbitmqCluster, or its owned resources i.e. StatefulSet, Services, ConfigMap, Secrets, etc. Another event that triggers reconcile is when the Operator Pod is recreated.

Did RabbitMQ pods restart as a consequence of those reconcile events?

@coro
Copy link
Contributor

coro commented Jan 31, 2022

From looking at the reconcile logs here it looks like the manifest was an adaptation of the production-ready example:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: production-ready
spec:
  replicas: 4
  image: rabbitmq:3.8.21-management
  service:
    type: LoadBalancer
    annotations:
      "cloud.google.com/load-balancer-type": Internal
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 2
      memory: 4Gi
  rabbitmq:
    additionalConfig: |
      cluster_partition_handling = pause_minority
      vm_memory_high_watermark_paging_ratio = 0.99
      disk_free_limit.relative = 1.0
      collect_statistics_interval = 10000
  persistence:
    storage: "500Gi"
  terminationGracePeriodSeconds: 604800
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - production-ready
        topologyKey: kubernetes.io/hostname
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: "topology.kubernetes.io/zone"
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  app.kubernetes.io/name: production-ready

@coro
Copy link
Contributor

coro commented Jan 31, 2022

I've just attempted reproducing on GKE with the above manifest and triggering a rolling upgrade of the Kubernetes nodes, and can't reproduce. I do see when a node where a RabbitmqCluster Pod is scheduled goes down for upgrade, this triggers several Reconcile loops as the operator receives the delete event for the Pod it owns, and then recreates the Pod. It looks like there's about 4 Reconcile loops triggered per Pod from my tests.

I agree with @Zerpet, I think it's important to know whether the Reconciles happened at the exact same time as the node restarts or not, as we would expect a handful of Reconciles per restart. Also, did the Reconciles continue to occur long after the node restarts? On my system, I'm seeing the logs completely stop after the node upgrades are completed.

@github-actions
Copy link

github-actions bot commented Apr 2, 2022

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Apr 2, 2022
@github-actions
Copy link

github-actions bot commented May 2, 2022

Closing stale issue due to further inactivity.

@github-actions github-actions bot added the closed-stale Issue or PR closed due to long period of inactivity label May 2, 2022
@github-actions github-actions bot closed this as completed May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

3 participants