New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132
Comments
I see the same (but chart v0.9.3) - it will just freeze and system load will increase and increase. doing |
I have tried with a new single-node cluster from scratch this time with 4 cores from UpCloud, which also happens to have the fastest disks (by far) I've seen with the cloud providers I have tried, so it's unlikely a problem with disks. Well, exactly the same problem. :( After a little while downloading many largish files like videos, the server became totally unresponsive I couldn't even SSH into it again. Like I said earlier, with the previous version of Rook I could do exactly the same operation (basically I am testing the migration of around 25GB of Nextcloud data from an old pre-Kubernetes server) even with servers with just 2 cores using Rook v0.9.3. I am going to try again with this version... |
Also, since I couldn't SSH into the servers I checked the web console from UpCloud and saw this: Not sure if it's helpful.... I was also wondering whether there are issues using Rook v1.0 with K3S, since I've used K3S with these clusters (but also with v0.9.3 which was OK). Perhaps I should also try with standard Kubernetes just to see if there's a problem there. I'll do this now.. |
@vitobotta , I've seen this hung-task message when something like RBD or Cephfs is unresponsive and a VM thinks that the I/O subsystem is hung. So the question then becomes why is Ceph unresponsive? Is the Ceph cluster healthy when this happens? ceph health detail. Can you get a dump of your ceph parameters using the admin socket, something like "ceph daemon osd.5 config show". Does K8S show any Ceph pods in bad state? You may want to pay attention to memory utilization by OSDs. What is the CGroup memory limit for rook.io OSD pods and what is the ceph.conf-defined osd_memory_target set to? Default for osd_memory_target is 4 GiB, much higher than default for OSD pod "resources": "limits". This can cause OSDs to exceed the CGroup limit. Can you do a "kubectl describe nodes" and look at what the memory limits for the different Ceph pods actually are? You may want to raise limits in cluster.yaml and/or lower osd_memory_target. Let me know if this helps. See this article on osd_memory_target |
Hi @bengland2, yes the clusters (I have tried with several) were healthy etc when I was doing these tests. In the meantime I have recreated the whole thing again but this time with OpenEBS instead of Rook just to test, and while OpenEBS was slower I didn't have any issues at all, with load never above 4. With Rook, same test on same specs it reached 40 or even more until I had to forcefully reboot, and this happened more than once. I am going to try once again with OpenEBS to see if I was just lucky... |
@vitobotta Sounds like you are copying files to an RBD volume. Try lowering your kernel dirty pages way down (e.g. sysctl -w vm.dirty_ratio=3 vm.dirty_background_ratio=1) on your RBD client and see if that makes write response times more reasonable. Also, maybe you need to give your OSDs more RAM, in rook this is done with resources: parameter. A Bluestore OSD expects to have > 4 GiB of RAM by default. Older rook.io may not be doing this by default. Ask me if you need more details. |
The weird thing is that I didn't seem to have these issues with the previous version, using the same specs and config. Not sure of what rbd client you mean, I just mounted the /dev/rbX device into a directory :) |
@vitobotta by "RBD client" I meant the host where you mounted /dev/rbdX. Also I expect you are using Bluestore not Filestore OSDs. |
I think Filestore since I was using a directory on the main disk rather than additional disks. |
Filestore is basically in maintenance mode at this point, you should be using Bluestore, which has much more predictable write latency. Let us know if Bluestore is giving you trouble. |
Hi @bengland2, I didn't read anywhere that Filestore (thus directories support?) is in not in active development, I must have missed it... I will try with additional disks instead of directories so I can test with Bluestore when I have time. Today I had a chance to repeat the test with that 25GB of mixed data with a new 3-node cluster with Rook 1.0 installed. The test started well until it was extracting/copying videos, at which point once again the load average climbed quite quickly to over 70 on a node and 40 on another, so I had to forcefully reboot the two nodes. I uninstalled/cleaned up Rook completely, and repeated the test with OpenEBS first, and Longhorn after that. OpenEBS was again very very slow but worked, while Longhorn reached a load of max 12 when processing videos but then it completed the task and I was able to move on. Also this time I am running standard Kubernetes 1.13.5, not K3S, so I have excluded both that it could be a problem with K3S, and that it could be a problem with the provider I was using before (Hetzner Cloud). I don't know what to say... I hoped I could use Rook because it's faster and I have heard good things, but for me from these tests it looks almost unusable when dealing with large files. At least that's the impression I have unfortunately :( I will try with disks instead of directories when I have a chance. Thanks |
I can't believe it! :D I decided to try Bluestore now because I want to understand what's going on, so I set up a new cluster this time with DigitalOcean (3x 4 cores, 8GB ram) and added volumes to the droplets, so to use these disks with Ceph instead of a directory on the main disk. I was able to complete the usual test and the load never went above 5 when extracting videos! I don't think it's because of DigitalOcean vs Hetzner Cloud/UpCloud, I guess the problem was as you suggested Filestore with directories. But out of curiosity why is there such a big difference in performance and CPU usage between Filestore and Bluestore? Thanks! I'm gonna try the experiment once again just in case, and if it works I will be very happy! :) |
Tried again and had the same problem. :( |
I believe this may be an issue with Ceph itself. It's my understanding that the Ceph OSDs with Bluestore can use a lot of CPU in some circumstances. I think this is especially true for clusters with many OSDs and clusters with very fast OSDs. Bluestore will generally result in better performance compared to Filestore, but the performance also comes with more CPU overhead. It's also my understanding that in today's hardware landscape, Ceph performance is often bottlenecked by CPU. Update: I created a Ceph issue here https://tracker.ceph.com/issues/40068 |
@BlaineEXE To see what they are doing about it, see project crimson. Ceph was designed in a world of HDDs, with 3 orders of magnitude less random IOPS per device. So yes it needs an overhaul, and they are doing that. Ceph is not the only application that is dealing with this. |
An update... as suggested by @BlaineEXE I did my usual test but using the latest Mimic image instead of Nautilus. It worked just fine with two clusters and managed to finish copying the test data with a very low CPU usage. I repeated this twice with two different clusters, successfully both times. For the third test, I just updated Ceph to Nautilus on the second cluster, and surprisingly the test finished ok again. But then I created a new cluster with Nautilus from the start and boom, usual problem. Started OK until I had to forcefully reboot the server. This is a single node cluster (4 cores, 16 GB of ram) with Rook 1.0.1 on Kubernetes 1.13.5 deployed with Rancher. There's a problem somewhere, I just wish I knew where. |
Is @sp98 still working on this issue? It would be great to see if there are any noticeable differences between how Rook starts up a Mimic cluster compared to how it starts up a Nautilus cluster to determine if Rook is the cause. We should also pay close attention to the behavior of ceph-volume, as the initial OSD prep using Mimic's c-v could be different than the prep using Nautilus' c-v. |
@BlaineEXE Yes. But had to move to 2696. Will jump back up to this one in few days time. Thanks for those updates above. I'll try that and update my findings here. |
Just tried once again with a new cluster, again with the latest version from the start, same problem. As of now I am still unable to actually use Rook/Ceph :( It's not like there are things that I could do wrong because it's so easy to install etc... so I don't know where to look. This time the problem occurred very quickly after I started copying data into a volume. I was wondering, could it be something related to using a volume directly bypassing kubernetes? Not sure if it's helpful, but what I am trying to do is download some data from an existing server into a volume so that I can use that data with Nextcloud. In order to do this, because there are timeouts etc if I try to do it from inside a pod, this is what I do to use the volume directly:
which gives me the device name, e.g. /dev/rbd3
It starts downloading the data and then at some random point, sooner or later load will climb very very quickly up to 70-80 or more until I have to forcefully reboot the server. Since as said I don't know where to look, I am really confused by this problem and I even thought it may have something to do with the fact that I am extracting the archive while downloading it (I know, it doesn't make sense), but the problem occurs also if I just download the archive somewhere first and then extract it into the volume. I am new to all of this so I wish I had more knowledge on how to further investigate :( |
@vitobotta Can you confirm if you have had this issue when deploying mimic (v13) or only with nautilus (v14)? If you haven't tried mimic with rook 1.0, could you try that combination? It would be helpful to confirm if this is a nautilus issue, or if it's rook 1.0 that causes the issue and it happens on both mimic and nautilus. @markhpc @bengland2 Any other secrets up your sleeve for tracking down perf issues besides what has been mentioned? Thanks! |
Hi @travisn , I did a couple of tests with Mimic the other day and didn't have any problems with it. I just tried again with Mimic (v13.2.5-20190410) right now and all was good. Since I was always using Rook 1.0.1, it seems like it may be an issue with Nautilus? I am using Ubuntu 18.04 with 4.15.0-50-generic, if that helps somehow. Once I did a test with Fedora 29 (I think?) as suggested by @galexrt and it worked fine, I don't know if I was just lucky.... perhaps I can try again... to see if it happens only with Ubuntu. |
Hi all, I have done some more tests with interesting results. By "tests" I don't mean anything scientific since I lack deeper understanding of how this stuff works. I mean the usual download of data into a volume as described earlier. I have repeated the same test with multiple operating systems and these are the results:
I don't know enough about this stuff to jump to conclusions but is it possible that there is a problem with Nautilus and the default Ubuntu 18.04 kernel? To exclude the possibility that it might be a problem with the customised kernel used by the provider, I have tried on Hetzner Cloud, UpCloud and DigitalOcean with the same result: the problem occurs with the default kernel but not with 5.0.0.15. Is there anyone so kind as to try and reproduce this? Please note that I as far as I remember I haven't seen the problem copying little amounts of data. It always happens when I copy that 24-5 GB of data that I am trying to migrate or also sometimes when I run a benchmark on a volume with fio. Thanks a lot in advance if someone can reproduce this / look into it. :) |
Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual... |
I'm not a Ceph expert but I do deal with lots of storage systems,
architectures, and technologies.
My first recommendation would be to consult with optimal hardware
requirements to ensure that software operates within boundaries of tested
matrix:
http://docs.ceph.com/docs/jewel/start/hardware-recommendations/
I would not recommend to just go with minimal requirements.
Double the numbers. Verify that your test environment conforms with that.
…On Sat, Jun 1, 2019 at 6:06 AM Vito Botta ***@***.***> wrote:
Guys... I tried again with the 5.0.0.15 kernel and it happened again :(
The first test copying the data into the volume was fine, but then I did a
backup with Velero followed by a restore and the system became unresponsive
during the restore, as usual...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3132?email_source=notifications&email_token=AAJVGV4AMJMJOTOXKXIGZW3PYJX4XA5CNFSM4HLNELB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXAI5Y#issuecomment-497943671>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJVGVZ3K6RJOCBGBAQFZ2TPYJX4XANCNFSM4HLNELBQ>
.
|
I'm running into this problem as well. It's causing my Kubernetes node to flap back and forth between NotReady and Ready; containers fail to start up, even a web browser or the system monitor lock up. The system ends up with over 1,000 processes eventually, and I think it's also causing my VirtualBox to not be able to start Currently on a bare metal single node master--k8s 1.14.1, rook deployed from release-1.0 branch with storageclass-test.yml and cluster-test.yml (except that databaseSizeMB, journalSizeMB, and osdsPerDevice was commented out). Host is running Ubuntu 18.04.2 (currently 4.18.0-20-generic kernel) and has 2x 10-core Xeon (20 cores, 40 threads total) with 96 GB of registered DDR4 running at 2133. 1 TB 970 EVO Plus NVMe drive. Suffice it to say, it should have plenty of CPU, RAM, and I/O speed... edit: |
I still encounter the issue with Rook 1.2.0 on Ubuntu 18.04 updated with XFS, server totally freezes, only responding to ping, nothing more. |
Hi @rofra, others
Out of interest, can you please give the specs of the machine you are using for rook? Can you show memory, core count, substrate (virtual or physical) and how many drives/OSD(s) are attached to the machine? An idea of the workload on the machine would also be useful. Cheers,
|
3 Hosts on Hetzner:
OSD are physical drive with bluestore on a CephCluster with FLEX driver, no CSI. Attached, a file with CPU, you will see the 100% CPU on the right, 15 hours after I launched the machine. After, I just dropped the machine. I just saw the same kernel entries in the log as @vitobotta Regarding Workload, I started a Pod with High disk io and medium RAM io. No SSH Connection, no Log in |
Hi Rofra, Although there could be other issues at play, I.E kernel bugs, its best to remember that Rook is just a wrapper for Ceph. There are many sources for recommended Ceph practices and requirements (manuals, blog posts, etc), but a general rule of thumb:
It's debatable if this should be one physical core or one vCPU/HT core. If you're running your OS + Ceph + Container workloads I can easily see the CPU being throttled. I would recommend using 4 VCPU and potentially limiting the amount of CPU used by your containers either via workload or namespace. I don't really like the idea of using Rook on public clouds or virtualised environments as you're essentially running two layers of abstraction for the same storage layer and usually both have their own redundancy mechanism (ceph is replicated and your underlying hardware platform is also probably replicated as well). The replication performed by Ceph is a CPU hungry task, so my guess is that it could be caused by this. Cheers,
|
I was able to consistently reproduce this problem running rook-ceph v1.0.6. In CephCluster, define In all the places using rook-ceph, avoid the "rook-storage" nodes. After this step, the cluster has become super stable. BTW, I don't see the same problem with rook v0.9.3 cluster. |
@liejuntao001 Are you using proxmox ? |
For all current and future conversation participants, the version of Ceph used, the kernel version of the host, and the kernel version in the container may also be very important parts of this issue. Please include that information with comments/reports. |
I couldn't find any commit that would be related to this problem, both in the upstream Linux's master branch and XFS development branch. In addition, I couldn't find any discussions that would be related to this problem, in linux-xfs mailing list.
From the kernel back traces that is provided by @Hugome , I found there would be a problem in the transaction commit of XFS. In both of two traces, these processes became 'D' state in https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366 These processes released CPU resource voluntarily in the following line. https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549 These two processes should be woken by the other process after that. However, it didn't occur since this process hanged up too, or was blocked too. Unfortunately, I'm not sure why it happens. The next step is reporting this issue to linux-xfs ML(*1). @BlaineEXE , I'd like to try this task if you're OK. If it's better to do this kind of task by maintainers, it's also OK. |
@satoru-takeuchi please report this as an issue. Thanks for lending your kernel expertise to us. |
Reported this issue to linux-xfs ML. |
I got a comment from an XFS guy. https://marc.info/?l=linux-xfs&m=158018349207976&w=2 He said the root cause would be kernel RBD driver. I asked about this problem in Ceph's kernel side |
I got an answer from a Ceph kernel guy. Here is the summary from the user's point of view.
For more information, please refer to the following URL if you're interested in the detailed kernel logic. https://marc.info/?l=ceph-devel&m=158029603909623&w=2 @BlaineEXE So, how should we deal with this Rook's issue. My idea is the followings.
|
@BlaineEXE wouldn't it possible to create a bug reporting template on github that contains this guidance? |
This is perfect for the "common issues" section of the Rook documentation, https://rook.io/docs/rook/v1.2/ceph-common-issues.html. |
@leseb I'll send a PR. |
Great thanks! |
FWIW, I have a separate 4 node Ceph cluster manually deployed with ceph-deploy, and my CPU load is very low, ~0.13 - ~0.3 on all hosts. ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) However, every node maxes out the RAM. I'm also using ext4 and there are ~100 LXC containers with mounted RBD volumes. I would assume having the RAM maxed out may complicate running/starting other containers. |
@BillSchumacher If you are using xfs you may hit the issue in the future so put monitoring in place. Also, did you try setting memory limits to help with the ram issue? @satoru-takeuchi Do you know if there is a Ceph bug report/mailing list thread discussing getting the fix into Ceph? |
@vitobotta mimic is pretty old, not sure that it works right with rook. I would try at least nautilus, which gets a lot more testing. Also, you have to distinguish between Ceph client and OSD. OSDs have memory limited by osd_memory_target and that works in nautilus latest. However, clients may do heavy client-side caching and use up the RAM unless you restrict their memory consumption in some way, this isn't different than any other Linux app - you may need to reclaim inactive pages sooner by changing vm.min_free_kbytes upward (triggers memory reclaim sooner) and do fsync() or equivalent to flush out dirty pages (so they can be reclaimed). Lowering memory limits in the container will not make you happy, if the container exceeds the memory limit, it is OOMkilled, this is standard K8S practice for a long time now. You need to ensure that the app itself does not get close to the limit. The app can use calls like fadvise() to state that it does not want certain files cached, for example - this works well when you know a big file is only going to be read once. You can also use O_DIRECT (some databases support this), although Linus won't like it ;-) HTH -ben |
Hi @bengland2 Not sure if you actually meant to reply to me, I have not been using Rook for a long time now since I use the block storage offered by the cloud provider. :) |
@BillSchumacher I meant you in above reply, oops. |
@SerialVelocity Sorry, I don't know. |
For posterity: https://tracker.ceph.com/issues/43910 |
I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers.
Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.
Is it just me or is it happening to others as well? Thanks!
The text was updated successfully, but these errors were encountered: