Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132

vitobotta · 2019-05-07T23:10:55Z

I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers.
Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.

Is it just me or is it happening to others as well? Thanks!

davidkarlsen · 2019-05-07T23:23:55Z

I see the same (but chart v0.9.3) - it will just freeze and system load will increase and increase.

doing iostat -x 1 you will see 100% util on the RDB devices, but not actually any I/O

vitobotta · 2019-05-08T10:52:22Z

I have tried with a new single-node cluster from scratch this time with 4 cores from UpCloud, which also happens to have the fastest disks (by far) I've seen with the cloud providers I have tried, so it's unlikely a problem with disks. Well, exactly the same problem. :( After a little while downloading many largish files like videos, the server became totally unresponsive I couldn't even SSH into it again. Like I said earlier, with the previous version of Rook I could do exactly the same operation (basically I am testing the migration of around 25GB of Nextcloud data from an old pre-Kubernetes server) even with servers with just 2 cores using Rook v0.9.3. I am going to try again with this version...

vitobotta · 2019-05-08T10:57:23Z

Also, since I couldn't SSH into the servers I checked the web console from UpCloud and saw this:

Not sure if it's helpful.... I was also wondering whether there are issues using Rook v1.0 with K3S, since I've used K3S with these clusters (but also with v0.9.3 which was OK). Perhaps I should also try with standard Kubernetes just to see if there's a problem there. I'll do this now..

bengland2 · 2019-05-08T17:44:25Z

@vitobotta , I've seen this hung-task message when something like RBD or Cephfs is unresponsive and a VM thinks that the I/O subsystem is hung. So the question then becomes why is Ceph unresponsive? Is the Ceph cluster healthy when this happens? ceph health detail. Can you get a dump of your ceph parameters using the admin socket, something like "ceph daemon osd.5 config show". Does K8S show any Ceph pods in bad state?

You may want to pay attention to memory utilization by OSDs. What is the CGroup memory limit for rook.io OSD pods and what is the ceph.conf-defined osd_memory_target set to? Default for osd_memory_target is 4 GiB, much higher than default for OSD pod "resources": "limits". This can cause OSDs to exceed the CGroup limit. Can you do a "kubectl describe nodes" and look at what the memory limits for the different Ceph pods actually are? You may want to raise limits in cluster.yaml and/or lower osd_memory_target. Let me know if this helps. See this article on osd_memory_target

vitobotta · 2019-05-08T18:45:07Z

Hi @bengland2, yes the clusters (I have tried with several) were healthy etc when I was doing these tests.

In the meantime I have recreated the whole thing again but this time with OpenEBS instead of Rook just to test, and while OpenEBS was slower I didn't have any issues at all, with load never above 4.

With Rook, same test on same specs it reached 40 or even more until I had to forcefully reboot, and this happened more than once. I am going to try once again with OpenEBS to see if I was just lucky...

bengland2 · 2019-05-08T20:17:03Z

@vitobotta Sounds like you are copying files to an RBD volume. Try lowering your kernel dirty pages way down (e.g. sysctl -w vm.dirty_ratio=3 vm.dirty_background_ratio=1) on your RBD client and see if that makes write response times more reasonable. Also, maybe you need to give your OSDs more RAM, in rook this is done with resources: parameter. A Bluestore OSD expects to have > 4 GiB of RAM by default. Older rook.io may not be doing this by default. Ask me if you need more details.

vitobotta · 2019-05-08T20:27:12Z

The weird thing is that I didn't seem to have these issues with the previous version, using the same specs and config. Not sure of what rbd client you mean, I just mounted the /dev/rbX device into a directory :)

bengland2 · 2019-05-08T20:32:06Z

@vitobotta by "RBD client" I meant the host where you mounted /dev/rbdX. Also I expect you are using Bluestore not Filestore OSDs.

vitobotta · 2019-05-08T20:50:21Z

I think Filestore since I was using a directory on the main disk rather than additional disks.

bengland2 · 2019-05-08T20:59:00Z

Filestore is basically in maintenance mode at this point, you should be using Bluestore, which has much more predictable write latency. Let us know if Bluestore is giving you trouble.

vitobotta · 2019-05-11T18:49:16Z

Hi @bengland2, I didn't read anywhere that Filestore (thus directories support?) is in not in active development, I must have missed it... I will try with additional disks instead of directories so I can test with Bluestore when I have time.

Today I had a chance to repeat the test with that 25GB of mixed data with a new 3-node cluster with Rook 1.0 installed. The test started well until it was extracting/copying videos, at which point once again the load average climbed quite quickly to over 70 on a node and 40 on another, so I had to forcefully reboot the two nodes. I uninstalled/cleaned up Rook completely, and repeated the test with OpenEBS first, and Longhorn after that. OpenEBS was again very very slow but worked, while Longhorn reached a load of max 12 when processing videos but then it completed the task and I was able to move on.

Also this time I am running standard Kubernetes 1.13.5, not K3S, so I have excluded both that it could be a problem with K3S, and that it could be a problem with the provider I was using before (Hetzner Cloud).

I don't know what to say... I hoped I could use Rook because it's faster and I have heard good things, but for me from these tests it looks almost unusable when dealing with large files. At least that's the impression I have unfortunately :(

I will try with disks instead of directories when I have a chance. Thanks

vitobotta · 2019-05-11T22:59:31Z

I can't believe it! :D

I decided to try Bluestore now because I want to understand what's going on, so I set up a new cluster this time with DigitalOcean (3x 4 cores, 8GB ram) and added volumes to the droplets, so to use these disks with Ceph instead of a directory on the main disk. I was able to complete the usual test and the load never went above 5 when extracting videos!

I don't think it's because of DigitalOcean vs Hetzner Cloud/UpCloud, I guess the problem was as you suggested Filestore with directories. But out of curiosity why is there such a big difference in performance and CPU usage between Filestore and Bluestore? Thanks! I'm gonna try the experiment once again just in case, and if it works I will be very happy! :)

vitobotta · 2019-05-12T08:42:04Z

Tried again and had the same problem. :(

BlaineEXE · 2019-05-29T15:56:23Z

I believe this may be an issue with Ceph itself. It's my understanding that the Ceph OSDs with Bluestore can use a lot of CPU in some circumstances. I think this is especially true for clusters with many OSDs and clusters with very fast OSDs.

Bluestore will generally result in better performance compared to Filestore, but the performance also comes with more CPU overhead. It's also my understanding that in today's hardware landscape, Ceph performance is often bottlenecked by CPU.

Update: I created a Ceph issue here https://tracker.ceph.com/issues/40068

bengland2 · 2019-05-29T17:07:49Z

@BlaineEXE To see what they are doing about it, see project crimson. Ceph was designed in a world of HDDs, with 3 orders of magnitude less random IOPS per device. So yes it needs an overhaul, and they are doing that. Ceph is not the only application that is dealing with this.

vitobotta · 2019-05-29T19:23:24Z

An update... as suggested by @BlaineEXE I did my usual test but using the latest Mimic image instead of Nautilus. It worked just fine with two clusters and managed to finish copying the test data with a very low CPU usage. I repeated this twice with two different clusters, successfully both times. For the third test, I just updated Ceph to Nautilus on the second cluster, and surprisingly the test finished ok again. But then I created a new cluster with Nautilus from the start and boom, usual problem. Started OK until I had to forcefully reboot the server. This is a single node cluster (4 cores, 16 GB of ram) with Rook 1.0.1 on Kubernetes 1.13.5 deployed with Rancher. There's a problem somewhere, I just wish I knew where.

BlaineEXE · 2019-05-29T19:53:15Z

Is @sp98 still working on this issue? It would be great to see if there are any noticeable differences between how Rook starts up a Mimic cluster compared to how it starts up a Nautilus cluster to determine if Rook is the cause. We should also pay close attention to the behavior of ceph-volume, as the initial OSD prep using Mimic's c-v could be different than the prep using Nautilus' c-v.

sp98 · 2019-05-30T15:23:39Z

@BlaineEXE Yes. But had to move to 2696. Will jump back up to this one in few days time. Thanks for those updates above. I'll try that and update my findings here.

vitobotta · 2019-05-31T13:42:20Z

Just tried once again with a new cluster, again with the latest version from the start, same problem. As of now I am still unable to actually use Rook/Ceph :( It's not like there are things that I could do wrong because it's so easy to install etc... so I don't know where to look. This time the problem occurred very quickly after I started copying data into a volume.

I was wondering, could it be something related to using a volume directly bypassing kubernetes?

Not sure if it's helpful, but what I am trying to do is download some data from an existing server into a volume so that I can use that data with Nextcloud. In order to do this, because there are timeouts etc if I try to do it from inside a pod, this is what I do to use the volume directly:

Install a new instance of Nextcloud, which has one volume for the data and one for the app html
scale the deployment to zero, so that there are no pods using the volumes
with the rook-toolbox, map the volumes, e.g.

rbd map <pvc> -p replicapool

which gives me the device name, e.g. /dev/rbd3

mount the volume into a temp directory on the host

mkdir nextcloud-data
mount /dev/rbd3 nextcloud-data

finally, download the data from the old server into this volume

mkdir -p nextcloud-data/_migration
cd nextcloud-data/_migration/
ssh old-server "cat /data/containers/nextcloud.tar" | tar -xvv

It starts downloading the data and then at some random point, sooner or later load will climb very very quickly up to 70-80 or more until I have to forcefully reboot the server. Since as said I don't know where to look, I am really confused by this problem and I even thought it may have something to do with the fact that I am extracting the archive while downloading it (I know, it doesn't make sense), but the problem occurs also if I just download the archive somewhere first and then extract it into the volume.

I am new to all of this so I wish I had more knowledge on how to further investigate :(
I don't have any of these issues when using Longhorn or OpenEBS but I would prefer using Rook at this point for the performance and because Ceph is an established solution, while the others are very new and have their own issues.

travisn · 2019-05-31T14:01:58Z

@vitobotta Can you confirm if you have had this issue when deploying mimic (v13) or only with nautilus (v14)? If you haven't tried mimic with rook 1.0, could you try that combination? It would be helpful to confirm if this is a nautilus issue, or if it's rook 1.0 that causes the issue and it happens on both mimic and nautilus.

@markhpc @bengland2 Any other secrets up your sleeve for tracking down perf issues besides what has been mentioned? Thanks!

vitobotta · 2019-05-31T14:27:51Z

Hi @travisn , I did a couple of tests with Mimic the other day and didn't have any problems with it. I just tried again with Mimic (v13.2.5-20190410) right now and all was good. Since I was always using Rook 1.0.1, it seems like it may be an issue with Nautilus? I am using Ubuntu 18.04 with 4.15.0-50-generic, if that helps somehow. Once I did a test with Fedora 29 (I think?) as suggested by @galexrt and it worked fine, I don't know if I was just lucky.... perhaps I can try again... to see if it happens only with Ubuntu.

vitobotta · 2019-06-01T10:29:45Z

Hi all, I have done some more tests with interesting results. By "tests" I don't mean anything scientific since I lack deeper understanding of how this stuff works. I mean the usual download of data into a volume as described earlier.

I have repeated the same test with multiple operating systems and these are the results:

Ubuntu 18.04: download/copy fails each time. Starts OK but then at some point (sometimes right away, sometimes even towards the end) it stops and I have to forcefully reboot the server because it becomes unresponsibe;
Fedora 29: I have tried 3 times, no problems;
CentOS 7: I have tried just once and had no problems;
RancherOS 1.5.2: I tried twice, one using the Ubuntu console and the other one using the Fedora console. The test failed both times but my understanding is that RancherOS uses Ubuntu kernel, although I am not 100% sure.
Finally, I tried three times with Ubuntu but upgrading the kernel to 5.0.0.15 before setting things up/doing the test. Each time the test worked fine without any problems.

I don't know enough about this stuff to jump to conclusions but is it possible that there is a problem with Nautilus and the default Ubuntu 18.04 kernel? To exclude the possibility that it might be a problem with the customised kernel used by the provider, I have tried on Hetzner Cloud, UpCloud and DigitalOcean with the same result: the problem occurs with the default kernel but not with 5.0.0.15.

Is there anyone so kind as to try and reproduce this? Please note that I as far as I remember I haven't seen the problem copying little amounts of data. It always happens when I copy that 24-5 GB of data that I am trying to migrate or also sometimes when I run a benchmark on a volume with fio. Thanks a lot in advance if someone can reproduce this / look into it. :)

vitobotta · 2019-06-01T13:06:16Z

Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...

dyusupov · 2019-06-01T16:34:24Z

I'm not a Ceph expert but I do deal with lots of storage systems, architectures, and technologies. My first recommendation would be to consult with optimal hardware requirements to ensure that software operates within boundaries of tested matrix: http://docs.ceph.com/docs/jewel/start/hardware-recommendations/ I would not recommend to just go with minimal requirements. Double the numbers. Verify that your test environment conforms with that.

…

On Sat, Jun 1, 2019 at 6:06 AM Vito Botta ***@***.***> wrote: Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3132?email_source=notifications&email_token=AAJVGV4AMJMJOTOXKXIGZW3PYJX4XA5CNFSM4HLNELB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXAI5Y#issuecomment-497943671>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJVGVZ3K6RJOCBGBAQFZ2TPYJX4XANCNFSM4HLNELBQ> .

ftab · 2019-06-06T15:52:23Z

I'm running into this problem as well. It's causing my Kubernetes node to flap back and forth between NotReady and Ready; containers fail to start up, even a web browser or the system monitor lock up. The system ends up with over 1,000 processes eventually, and I think it's also causing my VirtualBox to not be able to start

Currently on a bare metal single node master--k8s 1.14.1, rook deployed from release-1.0 branch with storageclass-test.yml and cluster-test.yml (except that databaseSizeMB, journalSizeMB, and osdsPerDevice was commented out).

Host is running Ubuntu 18.04.2 (currently 4.18.0-20-generic kernel) and has 2x 10-core Xeon (20 cores, 40 threads total) with 96 GB of registered DDR4 running at 2133. 1 TB 970 EVO Plus NVMe drive. Suffice it to say, it should have plenty of CPU, RAM, and I/O speed...

edit: iostat -x 1 shows utilization going very high on the NVME device most of the time - but almost no utilization (0-1%) on the rbd devices

rofra · 2019-12-29T20:45:11Z

I still encounter the issue with Rook 1.2.0 on Ubuntu 18.04 updated with XFS, server totally freezes, only responding to ping, nothing more.

CalvinHartwell · 2019-12-30T02:58:14Z

Hi @rofra, others

I still encounter the issue with Rook 1.2.0 on Ubuntu 18.04 updated with XFS, server totally freezes, only responding to ping, nothing more.

Out of interest, can you please give the specs of the machine you are using for rook? Can you show memory, core count, substrate (virtual or physical) and how many drives/OSD(s) are attached to the machine?

An idea of the workload on the machine would also be useful.

Cheers,

Calvin

rofra · 2019-12-30T06:44:35Z

Hi @rofra, others

I still encounter the issue with Rook 1.2.0 on Ubuntu 18.04 updated with XFS, server totally freezes, only responding to ping, nothing more.

Out of interest, can you please give the specs of the machine you are using for rook? Can you show memory, core count, substrate (virtual or physical) and how many drives/OSD(s) are attached to the machine?

An idea of the workload on the machine would also be useful.

Cheers,

Calvin

3 Hosts on Hetzner:

Node 1: Master + Worker: 2 VCPU, 4Go RAM, 1 OSD 100Gi
Node 2: Master + Worker: 2 VCPU, 8Go RAM, 1 OSD 100Gi
Node 3: Worker only: 2 VCPU, 8Go RAM, 2 OSD 2x100Gi

OSD are physical drive with bluestore on a CephCluster with FLEX driver, no CSI.

Attached, a file with CPU, you will see the 100% CPU on the right, 15 hours after I launched the machine. After, I just dropped the machine. I just saw the same kernel entries in the log as @vitobotta

Regarding Workload, I started a Pod with High disk io and medium RAM io. No SSH Connection, no Log in

CephCluster.yml.txt
Deployment.yml.txt
StorageClass.yml.txt

CalvinHartwell · 2019-12-30T23:52:05Z

Node 1: Master + Worker: 2 VCPU, 4Go RAM, 1 OSD 100Gi

Node 2: Master + Worker: 2 VCPU, 8Go RAM, 1 OSD 100Gi

Node 3: Worker only: 2 VCPU, 8Go RAM, 2 OSD 2x100Gi

Hi Rofra,

Although there could be other issues at play, I.E kernel bugs, its best to remember that Rook is just a wrapper for Ceph. There are many sources for recommended Ceph practices and requirements (manuals, blog posts, etc), but a general rule of thumb:

Ceph is fairly hungry for CPU power, but the key observation is that an OSD server should have one core per OSD. If you have two sockets with 12 cores each and put one OSD on each drive, you can support 24 drives, or 48 drives with hyper-threading (allowing one virtual core per OSD)

It's debatable if this should be one physical core or one vCPU/HT core. If you're running your OS + Ceph + Container workloads I can easily see the CPU being throttled. I would recommend using 4 VCPU and potentially limiting the amount of CPU used by your containers either via workload or namespace.

I don't really like the idea of using Rook on public clouds or virtualised environments as you're essentially running two layers of abstraction for the same storage layer and usually both have their own redundancy mechanism (ceph is replicated and your underlying hardware platform is also probably replicated as well). The replication performed by Ceph is a CPU hungry task, so my guess is that it could be caused by this.

Cheers,

Calvin

liejuntao001 · 2020-01-11T05:37:21Z

I was able to consistently reproduce this problem running rook-ceph v1.0.6.
I tried many things until latest work-around - separate the nodes which serve rook-ceph and the nodes using the persistent volume.
Apply node label "node-role.kubernetes.io/rook-storage" to all the nodes that you allow rook-ceph to run

In CephCluster, define
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/rook-storage
operator: Exists

In all the places using rook-ceph, avoid the "rook-storage" nodes.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/rook-storage
operator: DoesNotExist

After this step, the cluster has become super stable.

BTW, I don't see the same problem with rook v0.9.3 cluster.

Arkalys · 2020-01-11T12:20:06Z

@liejuntao001 Are you using proxmox ?

BlaineEXE · 2020-01-14T18:30:01Z

For all current and future conversation participants, the version of Ceph used, the kernel version of the host, and the kernel version in the container may also be very important parts of this issue. Please include that information with comments/reports.

satoru-takeuchi · 2020-01-19T09:18:52Z

I'll search whether the similar bug fixes and any bug reports exist or not.

I couldn't find any commit that would be related to this problem, both in the upstream Linux's master branch and XFS development branch. In addition, I couldn't find any discussions that would be related to this problem, in linux-xfs mailing list.

In addition, I'll read kernel code.

From the kernel back traces that is provided by @Hugome , I found there would be a problem in the transaction commit of XFS. In both of two traces, these processes became 'D' state in
_xfs_log_force_lsn+0x20e/0x350 [xfs]. This code is one of the following two xlog_wait().

https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366
https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3387

These processes released CPU resource voluntarily in the following line.

https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549

These two processes should be woken by the other process after that. However, it didn't occur since this process hanged up too, or was blocked too. Unfortunately, I'm not sure why it happens.

The next step is reporting this issue to linux-xfs ML(*1). @BlaineEXE , I'd like to try this task if you're OK. If it's better to do this kind of task by maintainers, it's also OK.

BlaineEXE · 2020-01-24T16:50:11Z

@satoru-takeuchi please report this as an issue. Thanks for lending your kernel expertise to us.

satoru-takeuchi · 2020-01-27T05:40:45Z

Reported this issue to linux-xfs ML.

https://marc.info/?l=linux-xfs&m=158009629016068&w=2

satoru-takeuchi · 2020-01-29T07:13:20Z

I got a comment from an XFS guy.

https://marc.info/?l=linux-xfs&m=158018349207976&w=2

He said the root cause would be kernel RBD driver. I asked about this problem in Ceph's kernel side
mailing list (ceph-devel ML).

https://marc.info/?l=ceph-devel&m=158028268806093&w=2

satoru-takeuchi · 2020-01-30T23:23:11Z

Feb 4, 2020: Fixed the description based on the comments in Ceph: Add an rbd hang up problem in the common issues #4802 .

I got an answer from a Ceph kernel guy. Here is the summary from the user's point of view.

When does this problem happen?
- Making an XFS filesystem on the top of RBD and NDB.
- Its possibility gets higher when RBD client and OSD daemon are co-located.
How to bypass this problem?
- Use ext4 or any other filesystems rather than XFS. Filesystem type can be specified with csi.storage.k8s.io/fstype in StorageClass resource.
Will this problem be fixed?
- Yes, with the following two fixes
  - Linux kernel 5.6 (not released yet), that includes the following patch
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d19f1c8e1937baf74e1962aae9f90fa3aeab463
  - Ceph with a fix for this problem. This fix uses a feature that is introduced by the above- mentioned patch. The Ceph community will probably become discuss this fix after releasing Linux v5.6.
Why does this problem happen?
- The deadlock of the following logic in the kernel.
  - XFS will start pruning its caches, taking filesystem locks and kicking off I/O on rbd
  - Memory allocation request(s) made by the OSD(s) to service that I/O may recurse back onto the same XFS, needing the same locks.

For more information, please refer to the following URL if you're interested in the detailed kernel logic.

https://marc.info/?l=ceph-devel&m=158029603909623&w=2

@BlaineEXE So, how should we deal with this Rook's issue. My idea is the followings.

Close this issue since it's not a Rook's problem.
Tell this information to Ceph's issue that you created, not to forget to fix Ceph side.

LenzGr · 2020-01-31T08:18:45Z

For all current and future conversation participants, the version of Ceph used, the kernel version of the host, and the kernel version in the container may also be very important parts of this issue. Please include that information with comments/reports.

@BlaineEXE wouldn't it possible to create a bug reporting template on github that contains this guidance?

leseb · 2020-01-31T09:10:43Z

This is perfect for the "common issues" section of the Rook documentation, https://rook.io/docs/rook/v1.2/ceph-common-issues.html.

satoru-takeuchi · 2020-01-31T09:16:35Z

@leseb I'll send a PR.

leseb · 2020-01-31T09:19:26Z

@leseb I'll send a PR.

Great thanks!

BillSchumacher · 2020-09-07T17:18:27Z

FWIW, I have a separate 4 node Ceph cluster manually deployed with ceph-deploy, and my CPU load is very low, ~0.13 - ~0.3 on all hosts.

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)

However, every node maxes out the RAM. I'm also using ext4 and there are ~100 LXC containers with mounted RBD volumes.

I would assume having the RAM maxed out may complicate running/starting other containers.

SerialVelocity · 2020-09-08T13:00:55Z

@BillSchumacher If you are using xfs you may hit the issue in the future so put monitoring in place. Also, did you try setting memory limits to help with the ram issue?

@satoru-takeuchi Do you know if there is a Ceph bug report/mailing list thread discussing getting the fix into Ceph?

bengland2 · 2020-09-08T14:18:55Z

@vitobotta mimic is pretty old, not sure that it works right with rook. I would try at least nautilus, which gets a lot more testing.

Also, you have to distinguish between Ceph client and OSD. OSDs have memory limited by osd_memory_target and that works in nautilus latest.

However, clients may do heavy client-side caching and use up the RAM unless you restrict their memory consumption in some way, this isn't different than any other Linux app - you may need to reclaim inactive pages sooner by changing vm.min_free_kbytes upward (triggers memory reclaim sooner) and do fsync() or equivalent to flush out dirty pages (so they can be reclaimed).

Lowering memory limits in the container will not make you happy, if the container exceeds the memory limit, it is OOMkilled, this is standard K8S practice for a long time now. You need to ensure that the app itself does not get close to the limit. The app can use calls like fadvise() to state that it does not want certain files cached, for example - this works well when you know a big file is only going to be read once. You can also use O_DIRECT (some databases support this), although Linus won't like it ;-)

HTH -ben

vitobotta · 2020-09-08T14:28:33Z

Hi @bengland2

Not sure if you actually meant to reply to me, I have not been using Rook for a long time now since I use the block storage offered by the cloud provider. :)

bengland2 · 2020-09-08T14:36:09Z

@BillSchumacher I meant you in above reply, oops.

satoru-takeuchi · 2020-09-09T02:21:58Z

@satoru-takeuchi Do you know if there is a Ceph bug report/mailing list thread discussing getting the fix into Ceph?

@SerialVelocity Sorry, I don't know.

johnstcn · 2021-04-01T11:43:34Z

For posterity: https://tracker.ceph.com/issues/43910

vitobotta added the bug label May 7, 2019

travisn added this to the 1.0 milestone May 14, 2019

travisn added this to To do: v1.0.x patch release in v1.0 May 14, 2019

travisn assigned sp98 May 14, 2019

BlaineEXE added the ceph main ceph tag label May 14, 2019

satoru-takeuchi mentioned this issue Jan 27, 2020

Rook: Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132 cybozu-go/neco#765

Closed

4 tasks

satoru-takeuchi mentioned this issue Feb 3, 2020

Ceph: Add an rbd hang up problem in the common issues #4802

Merged

9 tasks

leseb closed this as completed in #4802 Feb 4, 2020

v1.2 automation moved this from To do to Done Feb 4, 2020

nvoxland mentioned this issue Feb 8, 2020

Clustered Filesystem ruckstack/ruckstack#15

Open

yanniszark mentioned this issue Apr 19, 2020

storage_io_error while deploying cluster scylladb/scylla-operator#75

Closed

Madhu-1 mentioned this issue Jul 12, 2022

List supported file systems for csi.storage.k8s.io/fstype ceph/ceph-csi#3238

Closed

Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132

Very high CPU usage on Ceph OSDs (v1.0, v1.1) #3132

Comments

vitobotta commented May 7, 2019

davidkarlsen commented May 7, 2019

vitobotta commented May 8, 2019

vitobotta commented May 8, 2019

bengland2 commented May 8, 2019

vitobotta commented May 8, 2019

bengland2 commented May 8, 2019

vitobotta commented May 8, 2019

bengland2 commented May 8, 2019

vitobotta commented May 8, 2019

bengland2 commented May 8, 2019

vitobotta commented May 11, 2019

vitobotta commented May 11, 2019

vitobotta commented May 12, 2019

BlaineEXE commented May 29, 2019 • edited

bengland2 commented May 29, 2019

vitobotta commented May 29, 2019

BlaineEXE commented May 29, 2019

sp98 commented May 30, 2019 • edited

vitobotta commented May 31, 2019

travisn commented May 31, 2019

vitobotta commented May 31, 2019

vitobotta commented Jun 1, 2019

vitobotta commented Jun 1, 2019

dyusupov commented Jun 1, 2019 via email

ftab commented Jun 6, 2019 • edited

rofra commented Dec 29, 2019 • edited

CalvinHartwell commented Dec 30, 2019 • edited

rofra commented Dec 30, 2019 • edited

CalvinHartwell commented Dec 30, 2019

liejuntao001 commented Jan 11, 2020

Arkalys commented Jan 11, 2020

BlaineEXE commented Jan 14, 2020 • edited

satoru-takeuchi commented Jan 19, 2020

BlaineEXE commented Jan 24, 2020

satoru-takeuchi commented Jan 27, 2020

satoru-takeuchi commented Jan 29, 2020 • edited

satoru-takeuchi commented Jan 30, 2020 • edited

LenzGr commented Jan 31, 2020

leseb commented Jan 31, 2020

satoru-takeuchi commented Jan 31, 2020

leseb commented Jan 31, 2020

BillSchumacher commented Sep 7, 2020

SerialVelocity commented Sep 8, 2020

bengland2 commented Sep 8, 2020

vitobotta commented Sep 8, 2020

bengland2 commented Sep 8, 2020

satoru-takeuchi commented Sep 9, 2020

johnstcn commented Apr 1, 2021

BlaineEXE commented May 29, 2019 •

edited

sp98 commented May 30, 2019 •

edited

ftab commented Jun 6, 2019 •

edited

rofra commented Dec 29, 2019 •

edited

CalvinHartwell commented Dec 30, 2019 •

edited

rofra commented Dec 30, 2019 •

edited

BlaineEXE commented Jan 14, 2020 •

edited

satoru-takeuchi commented Jan 29, 2020 •

edited

satoru-takeuchi commented Jan 30, 2020 •

edited