Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Claudie failed to update machine types #1273

Closed
JKBGIT1 opened this issue Mar 13, 2024 · 3 comments
Closed

Bug: Claudie failed to update machine types #1273

JKBGIT1 opened this issue Mar 13, 2024 · 3 comments
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@JKBGIT1
Copy link
Contributor

JKBGIT1 commented Mar 13, 2024

Current Behaviour

I wanted to update the machine types of the Azure k8s worker nodes and the Azure loadbalancer node. The workflow failed on InstallVPN on Ansibler with the following error in the ansibler.

2024-03-13T13:02:03Z WRN Retrying command ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15 ... (4/5) module=ansibler
2024-03-13T13:07:05Z WRN Error encountered while executing ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15  : exit status 2 module=ansibler
2024-03-13T13:07:05Z ERR failed to execute cmd: ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15 : 
        azure-compute-lz9kdeo-1: failed
        task: Wait 300 seconds for target connection to become reachable/usable
        summary: timed out waiting for ping module test: [Errno None] Unable to connect to port 22 on 4.184.250.149 module=ansibler
2024-03-13T13:07:05Z INF Next retry in 160s... module=ansibler
2024-03-13T13:09:45Z WRN Retrying command ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15 ... (5/5) module=ansibler
2024-03-13T13:14:46Z WRN Error encountered while executing ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15  : exit status 2 module=ansibler
2024-03-13T13:14:46Z ERR Command ansible-playbook ../../ansible-playbooks/wireguard.yml -i inventory.ini -f 15  was not successful after 5 retries module=ansibler
2024-03-13T13:14:46Z ERR Error encountered while installing VPN error="error while running ansible for services/ansibler/server/clusters/jakub-ine9sp1-8i4gib3 : exit status 2:\n\tazure-compute-lz9kdeo-1: failed\n\ttask: Wait 300 seconds for target connection to become reachable/usable\n\tsummary: timed out waiting for ping module test: [Errno None] Unable to connect to port 22 on 4.184.250.149" cluster=jakub module=ansibler project=default-jakub

Also, the builder pod was restarted after the failed run. To see the error logs I had to use --previous flag.

2024-03-13T12:34:39Z INF Config finished building module=builder project=default-jakub
2024-03-13T12:37:42Z INF Processing cluster cluster=jakub module=builder
2024-03-13T12:37:42Z INF Calling BuildInfrastructure on Terraformer cluster=jakub-ine9sp1 module=builder project=default-jakub
2024-03-13T12:39:29Z INF BuildInfrastructure on Terraformer finished successfully cluster=jakub-ine9sp1 module=builder project=default-jakub
2024-03-13T12:39:29Z INF Calling InstallVPN on Ansibler cluster=jakub-ine9sp1 module=builder project=default-jakub
2024-03-13T13:05:10Z INF Received signal terminated module=builder
2024-03-13T13:05:10Z INF Builder stopped checking for new configs module=builder
2024-03-13T13:05:10Z INF Waiting for already started configs to finish processing module=builder
2024-03-13T13:14:46Z ERR Failed to build cluster error="error in Ansibler for cluster jakub project default-jakub : error while calling InstallVPN on Ansibler: rpc error: code = Unknown desc = error encountered while installing VPN for cluster jakub project default-jakub : error while running ansible for services/ansibler/server/clusters/jakub-ine9sp1-8i4gib3 : exit status 2:\n\tazure-compute-lz9kdeo-1: failed\n\ttask: Wait 300 seconds for target connection to become reachable/usable\n\tsummary: timed out waiting for ping module test: [Errno None] Unable to connect to port 22 on 4.184.250.149" cluster=jakub module=builder
2024-03-13T13:14:46Z ERR Error encountered while processing config error="error in Ansibler for cluster jakub project default-jakub : error while calling InstallVPN on Ansibler: rpc error: code = Unknown desc = error encountered while installing VPN for cluster jakub project default-jakub : error while running ansible for services/ansibler/server/clusters/jakub-ine9sp1-8i4gib3 : exit status 2:\n\tazure-compute-lz9kdeo-1: failed\n\ttask: Wait 300 seconds for target connection to become reachable/usable\n\tsummary: timed out waiting for ping module test: [Errno None] Unable to connect to port 22 on 4.184.250.149" module=builder project=default-jakub
2024-03-13T13:14:46Z INF Stopping Builder : http: Server closed module=builder

Expected Behaviour

Claudie updates the machine types of the nodes in the running cluster without any issues.

Steps To Reproduce

  1. Apply this InputManifest
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: jakub
  namespace: default
spec:
  providers:
    - name: azure-1
      providerType: azure
      secretRef:
        name: azure-sponsorship-secret
        namespace: default
    - name: azure-2
      providerType: azure
      secretRef:
        name: azure-berops-secret
        namespace: default
  nodePools:
    dynamic:
      - name: azure-control
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        count: 1
        serverType: Standard_B2ms
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
      - name: azure-compute
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        autoscaler:
          min: 1
          max: 12
        serverType: Standard_B2ms
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
      - name: azure-lb
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        count: 1
        serverType: Standard_B2ms
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
  kubernetes:
    clusters:
      - name: jakub
        version: v1.26.3
        network: 192.168.2.0/24
        pools:
          control:
            - azure-control
          compute:
            - azure-compute
  loadBalancers:
    roles:
      - name: apiserver-lb
        protocol: tcp
        port: 6443
        targetPort: 6443
        target: k8sControlPlane
    clusters:
      - name: jakub
        roles:
          - apiserver-lb
        dns:
          dnsZone: azure.e2e.claudie.io
          provider: azure-2
          hostname: jakub
        targetedK8s: jakub
        pools:
          - azure-lb
  1. When the workflow finishes, replace the machine type for azure-compute with Standard_B4ms and the machine type for azure-lb with Standard_B2s. After the changes, the manifest should look like this.
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: jakub
  namespace: default
spec:
  providers:
    - name: azure-1
      providerType: azure
      secretRef:
        name: azure-sponsorship-secret
        namespace: default
    - name: azure-2
      providerType: azure
      secretRef:
        name: azure-berops-secret
        namespace: default
  nodePools:
    dynamic:
      - name: azure-control
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        count: 1
        serverType: Standard_B2ms
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
      - name: azure-compute
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        autoscaler:
          min: 1
          max: 12
        serverType: Standard_B4ms
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
      - name: azure-lb
        providerSpec:
          name: azure-1
          region: Germany West Central
          zone: "1"
        count: 1
        serverType: Standard_B2s
        image: Canonical:0001-com-ubuntu-minimal-jammy:minimal-22_04-lts:22.04.202212120         
  kubernetes:
    clusters:
      - name: jakub
        version: v1.26.3
        network: 192.168.2.0/24
        pools:
          control:
            - azure-control
          compute:
            - azure-compute
  loadBalancers:
    roles:
      - name: apiserver-lb
        protocol: tcp
        port: 6443
        targetPort: 6443
        target: k8sControlPlane
    clusters:
      - name: jakub
        roles:
          - apiserver-lb
        dns:
          dnsZone: azure.e2e.claudie.io
          provider: azure-2
          hostname: jakub
        targetedK8s: jakub
        pools:
          - azure-lb
  1. Apply the InputManifest from the previous step.

Anything else to add

Maybe check if this error appears in other cloud providers.

@JKBGIT1 JKBGIT1 added the bug Something isn't working label Mar 13, 2024
@JKBGIT1
Copy link
Contributor Author

JKBGIT1 commented Mar 13, 2024

One more thing. kubectl delete inputmanifests.claudie.io jakub didn't work... It restarted builder and didn't leave any specific logs.

@JKBGIT1 JKBGIT1 added the groomed Task that everybody agrees to pass the gatekeeper label Mar 15, 2024
@bernardhalas
Copy link
Member

This is likely due to the fact that terraformer destroyed and re-created the nodes under different IPs.

Some providers allow in-place modification of the machine types, but not all. This depends on the cloud provider.

3 options here:

  • we run destroy + create workflows for the affected nodepools
  • we disallow the modification of the machine type field and recommend the users always add new nodepools and delete the old ones
  • we allow the in-place modification of the machine type field only on providers which support in-place node type modifications

@Despire
Copy link
Contributor

Despire commented May 15, 2024

Closing this issue as resolved by immutable nodepools #1378

@Despire Despire closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

No branches or pull requests

3 participants