[Other] Yaook SCS cluster debugging #556

cah-hbaum · 2024-04-08T12:19:53Z

This issue contains information/problems/data about debugging and working with the Yaook SCS cluster. It will be closed, when the parent issue is closed.

See #426

anjastrunk · 2024-04-09T10:31:40Z

I suggest to log/fix each bug/problem in a separate issue, as done on #557 and list these issues in #426 in section "bug fixing". @cah-hbaum What do you think?

cah-hbaum · 2024-04-09T11:25:55Z

No I think that would be too much overhead for no gain.
I would just log everything here in separate comments and link issues or similar things, if they're created externally.

I could've also done this in the separate issues already available for each standard, but most (or better all) bugs and problems are cluster-related and not specific to a standard.

cah-hbaum · 2024-04-09T11:34:48Z

08-04-2024
The virtualized Yaook cluster broke over the weekend. The exact reason isn't really known, but the problem was with multiple Openstack volumes managed by the Openstack CSI Cinder driver not being detached correctly. They would just hang around infinitely. Since our Openstack policy doesn't allow resetting volume states by users, I would have needed to involve our Operations team with this.
The problem could have been stemming from the fact, that one of the worker nodes wasn't in a ready state, so the ceph instance couldn't run on it, which probably prevented the detaching of volumes.

I tried to reset the Kubernetes cluster with yaook/k8s; this ejected the worker node, since the process failed because of problems with the two of the master nodes and couldn't finish rejoining the previously bad worker.
The master nodes were having problems with connecting to different debian repositories, probably because of high resource usage on the nodes.

After loosing a second master node, I decided to just reset the cluster completely, meaning deletion of all resources and a new cluster setup.

cah-hbaum · 2024-04-10T14:50:08Z

Had some problems with the new cluster, images couldn't seemingly be uploaded, neither from local files nor from a linked location.
This turned out to be a problem with glance and its secret containing the connection information to ceph. The secret wasn't copied correctly into the other namespace, resulting in an incorrect key distributed to glance, which then couldn't access ceph.

cah-hbaum · 2024-04-15T14:31:48Z

Problems are fixed for now (already applied the fixes on friday). The problem seemed to come from incorrectly created roles for the neutron-ovn-operator initially. After I fixed them manually, the ovnagents seemed to be the problem. They were created without the status key, because it wasn't available in the CRD. I needed to manually update the CRD and fix the ovnagents. After that was done, the cluster was running correctly.

cah-hbaum · 2024-05-21T12:45:23Z

Addendum from last week (~15.05.2024):

I tried to setup yaook/k8s in order to test the Kubernetes standards on an independent cluster, which isn't in use by an overlying setup like yaook/operator.

To do this, I updated my already existing yaook/k8s git repository and pulled the latest version available.
This version was released after the so called core-split, which essentially reworked the structure of the repository as well as the cluster build processes.

With this new version, everything went smooth until the calico-api-server should come up. This wasn't possible, due to the taints NoSchedule not being removed from the Worker nodes. I couldn't find a reason why this was the case, so I removed them manually, which helped finish the setup process.
This setup was then tested via the test script, which went through without problem.

cah-hbaum added the SCS-VP10 Related to tender lot SCS-VP10 label Apr 8, 2024

cah-hbaum self-assigned this Apr 8, 2024

cah-hbaum mentioned this issue Apr 8, 2024

[EPIC] Evaluate costs of making a cluster SCS compliant #426

Open

18 tasks

cah-hbaum mentioned this issue May 23, 2024

Create v2 of node distribution standard (issues/#494) #524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Other] Yaook SCS cluster debugging #556

[Other] Yaook SCS cluster debugging #556

cah-hbaum commented Apr 8, 2024

anjastrunk commented Apr 9, 2024

cah-hbaum commented Apr 9, 2024

cah-hbaum commented Apr 9, 2024 •

edited

cah-hbaum commented Apr 10, 2024 •

edited

cah-hbaum commented Apr 15, 2024

cah-hbaum commented May 21, 2024 •

edited

[Other] Yaook SCS cluster debugging #556

[Other] Yaook SCS cluster debugging #556

Comments

cah-hbaum commented Apr 8, 2024

anjastrunk commented Apr 9, 2024

cah-hbaum commented Apr 9, 2024

cah-hbaum commented Apr 9, 2024 • edited

cah-hbaum commented Apr 10, 2024 • edited

cah-hbaum commented Apr 15, 2024

cah-hbaum commented May 21, 2024 • edited

cah-hbaum commented Apr 9, 2024 •

edited

cah-hbaum commented Apr 10, 2024 •

edited

cah-hbaum commented May 21, 2024 •

edited