[GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP #542

LarsBingBong · 2023-05-15T13:04:44Z

I'm using KCLI version version: 99.0 commit: d4befb7 2023/05/15

and if I bump into X breaking error. E.g.:

<-------------> Checking whether the remote cluster - gcp-test - is already deployed <-------------> 
The remote gcp-test cluster is currently NOT deployed, continuing!
<-------------> Creating VM's & deploying K3s on the following client --- <------------->
Deleting directory /home/linuxlars/.kcli/clusters/gcp-test
Deleting loadbalancer api.gcp-test
Using keepalived virtual_router_id 73
Deploying Images...
Image ubuntu-minimal-2204-lts skipped!
Deploying Vms...
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/kcli", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/cli.py", line 5112, in cli
    args.func(args)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/cli.py", line 1906, in create_k3s_kube
    create_kube(args)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/cli.py", line 1882, in create_kube
    result = config.create_kube(cluster, kubetype, overrides=overrides)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/config.py", line 2755, in create_kube
    result = self.create_kube_k3s(cluster, overrides=overrides)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/config.py", line 2782, in create_kube_k3s
    return k3s.create(self, plandir, cluster, overrides)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/cluster/k3s/__init__.py", line 156, in create
    result = config.plan(plan, inputfile=f'{plandir}/bootstrap.yml', overrides=bootstrap_overrides)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/config.py", line 2203, in plan
    result = self.create_vm(name, profilename, overrides=currentoverrides, customprofile=profile, k=z,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/config.py", line 1051, in create_vm
    result = k.create(name=name, virttype=virttype, plan=plan, profile=profilename, flavor=flavor,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/kvirt/providers/gcp/__init__.py", line 181, in create
    conn.disks().insert(zone=zone, project=project, body=info).execute()
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/linuxbrew/.linuxbrew/opt/python@3.11/lib/python3.11/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 409 when requesting https://compute.googleapis.com/compute/v1/projects/nem-kubernemlig-poc/zones/europe-west1-b/disks?alt=json returned "The resource 'projects/nem-kubernemlig-poc/zones/europe-west1-b/disks/gcp-test-ctlplane-0-disk1' already exists". Details: "[{'message': "The resource 'projects/nem-kubernemlig-poc/zones/europe-west1-b/disks/gcp-test-ctlplane-0-disk1' already exists", 'domain': 'global', 'reason': 'alreadyExists'}]">
chmod: cannot access '/home/linuxlars/.kcli/clusters/gcp-test/auth/kubeconfig': No such file or directory

This causes the deploy to stop/break because a previous deploy had a breaking err ... apparently all remnants of a broken deploy is not automatically cleaned up by KCLI.

That being done would be lovely.

The text was updated successfully, but these errors were encountered:

karmab · 2023-05-15T19:03:49Z

Lets revisit once the workflow is actually known to work

larssb · 2024-04-03T13:19:11Z

Can we please open this again. We have the workflow working now and are also using KCLI scale for autoscaling via an in-cluster workload. Sometimes, when, and that can have many reasons, KCLI fail to provision a new worker into the cluster. A failure such as:

clm-autoscaling-up-7hhdf-kl958 knl-clm [DEBUG]: The generated new nodes YAML is: test-test-worker-xx:
clm-autoscaling-up-7hhdf-kl958 knl-clm   nets:                                                       
clm-autoscaling-up-7hhdf-kl958 knl-clm     - name: knl-test                                          
clm-autoscaling-up-7hhdf-kl958 knl-clm       ip: 192.168.x.x                                 
clm-autoscaling-up-7hhdf-kl958 knl-clm       public: false                                           
clm-autoscaling-up-7hhdf-kl958 knl-clm   cmds:                                                       
clm-autoscaling-up-7hhdf-kl958 knl-clm     - bash /root/worker.sh --diskCount 2 --provider gcp       
clm-autoscaling-up-7hhdf-kl958 knl-clm   files:                                                      
clm-autoscaling-up-7hhdf-kl958 knl-clm     - path: /root/worker.sh                                   
clm-autoscaling-up-7hhdf-kl958 knl-clm       currentdir: True                                        
clm-autoscaling-up-7hhdf-kl958 knl-clm       origin: ~/iac-conductor/kubernetes/deploy/bootstrapping/worker.sh
clm-autoscaling-up-7hhdf-kl958 knl-clm     - path: /etc/udev/longhorn-data-disk-add.sh               
clm-autoscaling-up-7hhdf-kl958 knl-clm       currentdir: True                                        
clm-autoscaling-up-7hhdf-kl958 knl-clm       origin: ~/iac-conductor/kubernetes/deploy/cluster-configuration/storage/longhorn-data-disk-add.sh
clm-autoscaling-up-7hhdf-kl958 knl-clm     - path: /etc/logrotate.d/rsyslog                          
clm-autoscaling-up-7hhdf-kl958 knl-clm       currentdir: True                                        
clm-autoscaling-up-7hhdf-kl958 knl-clm       origin: ~/iac-conductor/kubernetes/deploy/bootstrapping/rsyslog  
clm-autoscaling-up-7hhdf-kl958 knl-clm   tags:                                                       
clm-autoscaling-up-7hhdf-kl958 knl-clm     - ssh-enabled-server                                      
clm-autoscaling-up-7hhdf-kl958 knl-clm     - knl-test-node                                           
clm-autoscaling-up-7hhdf-kl958 knl-clm [DEBUG]: The determined new worker count is going to be: 4.   
clm-autoscaling-up-7hhdf-kl958 knl-clm Exception in thread Thread-1 (threaded_create_vm):            
clm-autoscaling-up-7hhdf-kl958 knl-clm Traceback (most recent call last):                            
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner                        
clm-autoscaling-up-7hhdf-kl958 knl-clm     self.run()                                                
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/threading.py", line 975, in run      
clm-autoscaling-up-7hhdf-kl958 knl-clm     self._target(*self._args, **self._kwargs)                 
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/site-packages/kvirt/config.py", line 3322, in threaded_create_vm     
clm-autoscaling-up-7hhdf-kl958 knl-clm     result = self.create_vm(name, profilename, overrides=currentoverrides, customprofile=profile, k=z, 
clm-autoscaling-up-7hhdf-kl958 knl-clm              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/site-packages/kvirt/config.py", line 956, in create_vm               
clm-autoscaling-up-7hhdf-kl958 knl-clm     result = k.create(name=name, virttype=virttype, plan=plan, profile=profilename, flavor=flavor,     
clm-autoscaling-up-7hhdf-kl958 knl-clm              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^     
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/site-packages/kvirt/providers/gcp/__init__.py", line 285, in create  
clm-autoscaling-up-7hhdf-kl958 knl-clm     conn.disks().insert(zone=zone, project=project, body=info).execute()                               
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
clm-autoscaling-up-7hhdf-kl958 knl-clm     return wrapped(*args, **kwargs)                           
clm-autoscaling-up-7hhdf-kl958 knl-clm            ^^^^^^^^^^^^^^^^^^^^^^^^                           
clm-autoscaling-up-7hhdf-kl958 knl-clm   File "/usr/local/lib/python3.11/site-packages/googleapiclient/http.py", line 938, in execute         
clm-autoscaling-up-7hhdf-kl958 knl-clm     raise HttpError(resp, content, uri=self.uri)              
clm-autoscaling-up-7hhdf-kl958 knl-clm googleapiclient.errors.HttpError: <HttpError 409 when requesting https://compute.googleapis.com/compute/v1/projects/kubernemlig-test/zones/europe

.... could happen when autoscaling tries a little later to scale in the node.

So same suggestion and wish. Would be great if KCLI cleaned up in the cases that it fails to properly create a VM on the underlying HCI - in this case the GCP GCE.

Thank you.

larssb · 2024-04-12T12:23:24Z

Further info. We're usually seeing this when a VM previously existed and therefore there's dangling disks on e.g. in this case GCE on GCP.

See:

Which results in the node never coming up. And in the logs: Waiting 5s for 8 nodes to have a Pod CIDR assigned timing out ... which is good thing - so the Pod CIDR check is not an everlasting endless loop.

To counter this situation we'll have to implement some logic in our auto-scaling code that controls whether there's any remnants of a worker to be scaled in on the underlying HCI ( e.g. vm or disks ... ).

larssb · 2024-04-13T14:07:25Z

Hard to counter this when kcli -o yaml || --output yaml list disks | yq 'to_yaml' throws:

$ kcli --client gcp-test -o yaml list disks | yq 'to_yaml'
Error: bad file '-': yaml: control characters are not allowed

I'll try parsing the plain output from list disks ... with grep and so forth ...

Further, it would be great if there was a kcli get disk DISK_NAME < so a specific disk. Then I could control whether one disk out of a set of previously used disks on X VM is on the underlying HCI and get on with it.

Thoughts @karmab ?

larssb · 2024-04-13T14:13:56Z

Okay filtering with grep was actually quite easy. So: kcli --client gcp-test list disks | grep "^|.*VM_NAME". Still would be great with a command to disks for some potentially still existing VM or one that did not exist, but disks are not deleted. Assuming! Which is important - that implementing such feature makes this kinda lookup more efficient than list allll disks on a specific HCI project. Which, in theory should become slower as more disks are added ... clusters grow and what not.

Thanks

karmab · 2024-05-22T08:53:56Z

sorry I didnt get back at this before...
I have mixed feelings about this, It seems the real issue is that when you scale, you have a worker whose disks already exist (maybe from a previous run or something). I think that's what needs to be addressed.
Trying to undo everything because something might break in the middle looks cumbersome to me, in such cases, it's better to either delete the cluster all together or at least the vm (which should take care of deleting any disks associated to it)

larssb · 2024-05-22T11:50:40Z

We worked around this by:

controlling whether disks already exists for a worker named x.y.z. If so go to the next available worker name

However, I still think that it would be great if KCLI cleaned up in the case that it couldn't successfully create a node in a way that dangling disks are leftover.

Deleting the cluster all together is pretty tough when the feature having issues is our auto-scaling feature. So there's nothing wrong with the state of the cluster. More so, that there's leftovers on the underlying HCI on-top of where the cluster is running.

LarsBingBong changed the title ~~Always clean-up on breaking errors when deploying K3s cluster on GCP~~ [GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP May 15, 2023

karmab closed this as completed May 15, 2023

karmab reopened this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP #542

[GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP #542

LarsBingBong commented May 15, 2023

karmab commented May 15, 2023

larssb commented Apr 3, 2024

larssb commented Apr 12, 2024

larssb commented Apr 13, 2024

larssb commented Apr 13, 2024

karmab commented May 22, 2024

larssb commented May 22, 2024

[GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP #542

[GCP Provider] Always clean-up on breaking errors when deploying K3s cluster on GCP #542

Comments

LarsBingBong commented May 15, 2023

karmab commented May 15, 2023

larssb commented Apr 3, 2024

larssb commented Apr 12, 2024

larssb commented Apr 13, 2024

larssb commented Apr 13, 2024

karmab commented May 22, 2024

larssb commented May 22, 2024