-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s-dqlite spiking cpu core to 100% #3227
Comments
I am facing the same issue on multiple AWS EC2s, each running single node Microk8s instances. Microk8s version: 1.23 classic Enabled addons:
For example, here is a screenshot of htop on an EC2.XLarge (16Gb memory): Microk8s was running smoothly until this week. On the other hand, instances running microk8s version 1.18 were not affected. |
We found similar results where the dqlite service on the leader node was hitting 100% usage. It is mentioned in a few other issues but dqlite is sensitive to slow disk performance. In other scenarios such as a node drain, it took a while to write to the database. You should see in the logs, something similar to We monitored the CPU and RAM utilisation for Dqlite IdleEtcd Idle |
I'm seeing something similar. I was running a 4-node HA cluster but it failed (see #3735) so I removed 2 nodes to disable HA mode and hopefully restore quorum , now running 2 nodes, 1 is master. The master has a dqlite process rammed at 100% CPU. Running
Hardware isn't exactly a rocketship but all my nodes are i5-6500T with 4 cores, 16GB memory, 256GB SSD and that should be adequate. Most of my workloads are not running at the moment, either. |
Hi all. I work on dqlite, and I'm going to try to figure out what's causing these CPU usage spikes. If you're experiencing this issue on a continuing basis and are in a position to collect some diagnostic info, I could use your help! Short of a reproducer that I can run myself (which I realize is difficult for this kind of complex system), the data I'd find most helpful would be a sampling profiler report showing where the k8s-dqlite process is spending its time during one of these spikes. A separate report for the same process and workload during a period of nominal CPU usage would also be great, so I can compare the two and see if anything stands out. You can gather this information as follows:
If the spikes last long enough that you can notice one happening and have time to ssh in and gather data before it ends, do that. Otherwise, since it's probably not feasible to just leave perf running (perf.data would get too huge), you could have a script like
Or with a longer timeout, etc. Then if you notice after the fact that a spike has occurred, you hopefully still have the perf.data file for that time period around. Thanks for your reports -- let's get this issue fixed! |
Hey @cole-miller, thanks for looking at this. I've done a
With those caveats, here's my attached |
@cole-miller likewise, k8s-dqlite doesn't spike, it's just at 100% all the time. |
It seems like a lot of people are affected by this issue. |
As well as my prod cluster being affected by this, last week I quickly threw together a 3-node MicroK8s cluster on v1.26 in VirtualBox to test something. No workloads. Initially it worked normally, but then I shut down all 3 VMs. When I booted them up again later, I had the dqlite 100% CPU problem. I didn't have time to look into it as I was working on something else, but it does show that it can happen on a new cluster that hasn't been "messed with". |
I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204) I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted). A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move. |
Hi Jonathan, I lead Kubernetes at Canonical and I wanted to firstly offer both an apology and a thank you for your help thus far. We know that there are situations where people like yourself are using MicroK8s and suddenly unbeknownst to you something goes wrong - not being able to easily solve that only compounds the problem. Our ambition for MicroK8s is to keep it as simple as possible, both in day-to-day operation but also upgrading. I wanted to thank you again for taking the time to engage with us and help us try and improve our projects. Projects plural, as DQLite is also something we build and maintain, hence our investment in building it for a low-ops K8s database backend. ( That said, there is the option to run ETCD with MicroK8s should you desire too ). This resilience issue is being taken extremely seriously and we are configuring machines to try and reproduce your environment as we are aware of it to the best of our abilities and to work with our DQLite team counterparts to resolve any performance issues. ( Please do let us know what your storage configuration is, localpath etc ). I believe one thing that really sets MicroK8s apart from alternatives is that we have no secret agenda here. All that said, I will do the following immediately:
|
Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful). My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage. I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class. I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that! |
Thanks for the additional detail, I've setup a bare metal cluster ( albeit lower scale than yours ) and will look to install EBS/cstor with rook-ceph. We will then conduct a soak test and a variety of interrupts to generate data |
Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor. |
It's still worth investigating as there might be a disk activity correlation - Will set this up now |
Are you using CephFS or RBD? If so how's that interacting with the cstor SC? |
I'm not using CephFS or RBD, only OpenEBS & cStor |
Okay, I saw it was mentioned at the top of the thread, thanks! |
No worries, this isn't my thread originally, but I am affected by the same dqlite CPU spike |
Hi @djjudas21, @doctorpangloss -- thanks very much for uploading the perf files, they're quite useful for narrowing down the root of the problem. Here are the resulting flamegraphs:
As you can see, CPU usage is dominated by calls to sqlite3_step. Looking at just the children of that function, the big contributions are from SQLite code, with some contributions also from dqlite's custom VFS (about 14% of the grand total), much of which boils down to calls to memcpy (9%). So my preliminary conclusion is that most of the CPU cycles are spent in SQLite, in which case it's likely the route to fixing this problem lies in optimizing the requests send by microk8s to dqlite (via the kine bridge). But I'll continue to investigate whether any parts of dqlite stand out as causing excessive CPU usage. |
One possible issue: dqlite runs sqlite3_step in the libuv main thread, so if calls to sqlite3_step are taking quite a long time then we're effectively blocking the event loop -- which could have bad downstream consequences like Raft requests timing out and causing leadership churn. @djjudas21, @doctorpangloss, and anyone else who's experiencing this issue, it'd be very helpful if you could follow these steps to generate some data about the distribution of time spent in sqlite3_step:
Thanks again for your willingness to help debug this issue! |
@cole-miller I could capture the requested histogram during about 3 minutes of such an event:
This is a 3 nodes cluster with 2 12-CPU nodes and 1 2-CPU node. The particularity (?) is that there is a relatively high (~10ms) latency between nodes 1,2 and node 4. Below the CPU load measurement during the event. This event was on the little node, but it can equally happen on the big nodes, sending CPU load to 300% (apparently due to iowait, not sure).
|
As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated. |
|
@sbidoul, @doctorpangloss -- thanks! It does look like a substantial number of our calls to sqlite3_step are taking milliseconds to tens of milliseconds, which is not outrageous but is long enough that we should probably try to move those calls out of the main thread. (We will want to do this in any case for the experimental disk mode.) I will try to work up a patch that does that and we'll make it available for you to try out when it's ready. (I don't know what to think about the small number of calls that took entire seconds to complete.) |
would it be helpful to have a reproducible environment? i can give you access to a hardware cluster reproduction of what is executing here. |
@doctorpangloss Yes, that would be very helpful! My SSH keys are listed at https://github.com/cole-miller.keys, and you can contact me at cole.miller@canonical.com to share any private details. |
@sbidoul I had some trouble generating a flamegraph or Re: your second graph, I'm a little confused because it looks like CPU usage for dqlite goes down during the spike event, and it's the kubelite process that's responsible for the spike. Am I misinterpreting?
The main thing that dqlite has to do even in the steady state where it's not receiving any client requests is to exchange Raft "heartbeat" messages with other nodes, so that they don't think it has crashed. If you can gather perf data for one of those idle periods I'd be happy to try to interpret the results (it would be educational for me too, and we might uncover something unexpected). |
@ole1986 I have another suggestion you can connect to dqlite while cluster is running.
and check if there is too much to compact. This is example with results on my cluster. First one is too big, and 999 is minimum possible value.
|
@miro-balaz thank you. We already gave up our cluster. But I could get into the database through the uncompressed snapshot file and also noticed redundant registry entries while often only a single byte in value/old_value column is being changed for the same name Example for /registry/masterleases/10.251.6.106 (first and second occurrence)
BTW: how do you know what records to keep while deleting? Simple the one with the highest IMO there should be some cleanup procedure or at least a quick guide on how to manually apply compaction or maintain the database. |
I read through this thread looking for a solution and wondered if I could/should use etcd with microk8s instead, and whether that's supported. I found that I could disable the Sharing in case others are in the same spot reading through. |
@david-streamlio @snydergd I think it would be enough to compile k8s-dqlite with new version of kine, or bump version of kine for older versions of dqlite and create new releases. |
@miro-balaz Thanks for the suggestion, but I am not familiar enough with the snap package manager to figure out how to make microk8s use the newly compiled version of kine. It would be a lot easier if the newer version of the k8s-dqlite was available on a snap release branch, edge perhaps? Secondly, how do I know that the new version of the k8s-dqlite resolves this issue? |
I have encountered this bug more than once, after an out of disk space, and
in two cases I was able to examine the database and it contained
overwhelmingly large gaps. The reason for gaps might be that the sqlite
autoincrement increments also on failed insert. That means that if there is
out of disk space condition, autoincrement can still be incremented and as
a result, when conditions resolve, a record with a very high id is
inserted. This causes kine to insert special records to fill all the gaps,
but those records take a lot of space and also their inserting is time
consuming. After that the database is slower and the compaction
algorithm is too slow to keep up. After a certain period of time, the
database is too slow and certain critical actions start to timeout and it
cascades to an unusable cluster.
New kine has a faster compaction algorithm that will mitigate this
scenario. I am not sure if gaps are also deleted with compaction, but they
can be deleted manually.
I will try to provide Dockerfile with compiled k8s-dqlite with a higher
version of kine.
st 17. 1. 2024 o 19:28 David Kjerrumgaard ***@***.***>
napísal(a):
… @miro-balaz <https://github.com/miro-balaz> Thanks for the suggestion,
but I am not familiar enough with the snap package manager to figure out
how to make microk8s use the newly compiled version of kine. It would be a
lot easier if the newer version of the k8s-dqlite was available on a snap
release branch, edge perhaps?
Secondly, how do I know that the new version of the k8s-dqlite resolves
this issue?
—
Reply to this email directly, view it on GitHub
<#3227 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IHUY2PYSCOWEBFMPDULLYPAJ5NAVCNFSM5YFUCNGKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBZGYZTSMRVHA3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@miro-balaz Thank you for this great analysis! My problem is consistent with what you said because it began after my disk was nearly full (all my pods were being evicted). I have a backup of the dqlite folder, so will probably try this later today, or as soon as I can get k8s-dqlite built with newer kine and installed/running. |
@miro-balaz If you happen to have time, I would be very grateful for a Dockerfile or forks of the repos with updated kine working. I think I managed to get the latest version of kine (from github.com/k3s-io/kine) working with k8s-dqlite (main challenge was that k8s-dqlite was setting up go-dqlite's "app" to create/register the driver with database/sql before starting kine, while kine was setting up its own -- and for some reason which I still don't understand, the "app" was registering "dqlite-" as the provider name instead of just "dqlite"). It was a lot harder than I thought, but in the end, running with debug was only showing some polling queries happening rather than the compaction queries from before, and cpu usage was way way lower. For some reason kubelite says it can't connect to it through the socket, though (complaints of EOF while reading, as well as some connection refused messages), so I think I may be missing something still. I'm happy to try other things or show what I've got if that's helpful -- would love to try it out and share my results here. I have a tarball of |
I ran into this issue, with my cluster becoming very unstable, and dqlite going to 100% on the leader node, and kubelite constantly crashing. I ran the query: The result was 50-100k records and it was trending upwards slowly. I did some of the dqlite recovery suggestions from the blog post (copying the files from the most current dqlite node to the others) and that didn't seem to help. However, what did seem to help is stopping kubelite on all nodes, and then just leaving dqlite running. The uncompacted record count gradually trended downwards. The dqlite process eventually dropped below 100% CPU and the NVMe went to 100%, so it became IO-bound instead of CPU-bound. The uncompacted record count dropped to a few thousand. For some reason the memory use on that process started climbing and I had to restart the unit when it got to 28GB of RAM. Then a new leader took over, and went to 100% CPU and the record count went all the way back up to 90k or so. However, it eventually dropped and seemed stable. At this point I restarted kubelite and for the most part it ran. API calls (eg kubectl get nodes) returned results very quickly - previously it would take maybe 10+ seconds to return (not normal in this cluster). However, I suspect there is some kind of damage within the database as API calls return data that appears incorrect. For example, I'm running rook, and it looks like all my OSDs are running (ceph status reports this). However, a dump of the OSD pods from kubectl lists many as terminating/pending. The processes are clearly running. I might play around with this a bit more to see if I can resurrect the cluster. If not I'll try to back up the rook state and probably will follow the rook disaster recovery and try to get my OSDs working in k3s or something, and rebuild the cluster. |
A bit of an update: after rebooting a few nodes I now have a consistent API that looks normal (pod states reflect reality). However, if I start increasing the load on the cluster (especially with monitoring - prometheus/loki/etc) the dqlite leader saturates IO on its SSD and the record count starts creeping upwards again. Stopping kubelite on all nodes allows it to catch up. However, even with a fairly minimal load on the cluster dqlite is running near 100% CPU on the leader. My theory is that there is something wrong with the database causing it to require more resources to do the same work over time. If it saturates IO or CPU then it turns into a positive feedback loop. |
@snydergd @rich0 I just finished blog about the issue. https://miro-balaz.github.io/microk8s/sqlite/k8s-dqlite/2024/02/03/fixing-microk8s-database-overgrowth.html. With more queries and details. It is good if you mention microk8s version that you are using. |
@miro-balaz that was extremely helpful. I'm running v1.28.3 (mostly). I did roll back two nodes to 1.27 to see if an update a few days ago caused the issue. That didn't seem to help and so I probably should get them back on 1.28. I had 147k gap records in the database. After deleting them I have only 4.6k records, and no name having more than 300 individually. CPU use did drop a bit, though with my reduced cluster size it wasn't too unstable. Honestly, I'm not eager to rebuild my entire cluster and recover Rook, so if this is salvageable and ideally fixed in a future patch release that would be wonderful! |
@rich0 It seems there a was big change between 1.26 and 1.27 so if your situation is still bad you can try to downgrade once more. Or just grab k8s-dqlite of version at most 1.1.2 somewhere and replace the binary. I am not sure which version of microk8s uses which version of k8s-dqlite, I think 1.27 might have used one from 1.1.0 to 1.1.5 depending on minor version, the big change is between 1.1.2 and 1.1.3. But there will be fix for this later, so if you are fine you can wait. |
So, this definitely fixed the immediate problem, but I noticed that at least on my slower SSDs dqlite is struggling to keep up if I have monitoring enabled. I suspect it is just too much event/API/etc traffic, plus there are a lot of pods/daemonsets. So, I suspect that maybe my cluster has just outgrown dqlite performance, at least in its current state and with my hardware (hardly high-end, but still, even cheap SSDs should be able to run a few dozen pods). I'm going to look to migrate to etcd after some planning, but this bought me some time. I can definitely endorse your blog and I suspect that a mix of performance of dqlite itself and its tendencey to create gap records is causing issues. Stopping kubelite on most nodes will help it catch up once the gap records are purged, though obviously that is going to no longer restart failing pods/etc. |
Thank you for reporting this @CEDDM . If you could attach the inspection tarball we may be able to figure out what is happening. |
@ktsakalozos It's been a long time. microk8s has to switch back to exclusively etcd, especially for HA setups, which works flawlessly, and is not challenging to deploy as per k0s. The underlying problem is the lack of telemetry of workloads, which you are never going to be able to collect. |
Hi @ktsakalozos |
Here is a zip from the cleaned tarball of the other VM (Ubuntu 22.04, 32GB, 8vCPU) with microk8s 1.25 running but consuming all CPU : |
same issue, today my microk8s completely crashed after just few days of running with no real workloads, eat up all memory and cpu, hard reset now it just eats all the cpu/ram again and won't start any container
after months of no fixes I switched to k3s |
@cole-miller @ktsakalozos @AlexsJones this is you guys for the last 2 years |
Has anyone tried the v1.30 of m8s to see if the mentioned improvement of dqlite has made a difference? |
My cluster is now inoperable this past two weeks due to this bug. I have a few nodes running on 8GB RPI4 boards that are so impacted by this issue that I cant even ssh into them. In the few cases where IU have been able to reach them, Im actually seeing upwards of 200% cpu usage and 100% memory usage. Even my 32gb RAM mac mini nodes is running into this same issue. At this point my cluster has completely imploded due to this cpu/mem overutilization. |
I am using 1.30/stable currently and saw no difference |
Same problem here, dqlite is using massive amounts of CPU on a 3 node Microk8s v1.30 cluster. |
I ended up moving away from HA mode on my cluster and have been having
better results.
…On Sun, Jun 2, 2024 at 6:31 PM Maxime Moreillon ***@***.***> wrote:
Same problem here, dqlite is using massive amounts of CPU on a 4 node
Microk8s v1.30 cluster
—
Reply to this email directly, view it on GitHub
<#3227 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD63FL4QTPFIOQ7EPGLOXDZFOMMRAVCNFSM5YFUCNGKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJUGQYDIMRSG44Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
[image: photo]
Ken Kyger
Founder, FutureHax Development
<http://www.linkedin.com/in/r2doesinc/> <http://github.com/futurehax>
<http://stackexchange.com/users/1130968/r2doesinc>
<http://twitter.com/r2DoesInc>
407.965.8510
***@***.***
Skype: r2doesinc <#SignatureSanitizer_SafeHtmlFilter_>
futurehax.org
The best way to predict the future, is to create it.
Click here to schedule a meeting <https://calendly.com/futurehax>
IMPORTANT: The contents of this email and any attachments are confidential.
They are intended for the named recipient(s) only. If you have received
this email by mistake, please notify the sender immediately and do not
disclose the contents to anyone or make copies thereof.
|
Summary
I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the
/snap/microk8s/3204/bin/k8s-dqlite
process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used:
The specs of the machines are as follows:
The cluster has the metallb, dns, rbac, and storage enabled.
I've also deployed Rook-Ceph on the cluster.
What Should Happen Instead?
It shouldn't be using over 100% of a core.
Reproduction Steps
Introspection Report
inspection-report-20220608_143601.tar.gz
The text was updated successfully, but these errors were encountered: