-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rawhide kernel 6.10.0 >=20240514 - podman update device-read-bps = 0 #22701
Comments
Failed again, kernel-6.10.0-0.rc0.20240516git3c999d1ae3c7.5.fc41 |
Can you create simple reproducer? AFAIK cgroup setup depends podman -> crun -> systemd-> kernel so maybe check if the other components changed too. |
That has been my goal, as you might have predicted. However, |
@edsantiago there is no testing repo for Rawhide, so if an update fails gating there isn't really a proper repo to get it from, unfortunately. You have to get it from Koji. You can use openQA does record logs, but we don't happen to pipe the output of this specific test command to a file at present. It would be easy to do that if it's useful, though. @Luap99 it's the kernel that is causing this. The same test is passing just fine on every other Rawhide update; it fails only on kernel updates, which means the kernel is the cause. |
Thanks @AdamWill, I guess then we have to get a simple reproducer and file a kernel bug. |
I'm being lazy again: the failure is a 0514 kernel build. I see a 0517 koji build and have not seen any OpenQA error emails about it. Until I have reason to suspect otherwise, I'll assume the problem is fixed. (And will save myself the time of pulling the kernel and looking for a reproducer) |
sigh... never mind. 0517 did fail in OpenQA. Reproducer: # uname -r
6.9.0-0.rc7.20240510git448b3fe5a0ea.62.fc41.x86_64
# dnf -y install podman-tests
# podman run -d --name foo quay.io/libpod/testimage:20240123 sleep inf
<cid>
# podman exec foo cat /sys/fs/cgroup/io.max
# podman update --device-read-bps=/dev/zero:10mb foo
<cid>
# podman exec foo cat /sys/fs/cgroup/io.max
1:5 rbps=10485760 wbps=max riops=max wiops=max <<<<< THIS IS GOOD Then: # wget https://kojipkgs.fedoraproject.org//packages/kernel/6.10.0/0.rc0.20240517gitea5f6ad9ad96.6.fc41/x86_64/kernel{,-core,-modules,-modules-core}-6.10.0-0.rc0.20240517gitea5f6ad9ad96.6.fc41.x86_64.rpm
# dnf install kern*rpm; reboot Then # uname -r
6.10.0-0.rc0.20240517gitea5f6ad9ad96.6.fc41.x86_64
# podman rm -f -a
[repeat the podman run/update/exec from above]
1:5 rbps=0 wbps=0 riops=0 wiops=0 <<<<<< THIS IS NOT GOOD |
Filed rhbz2281805 |
Does this still happen with 6.10 rc1? |
If by rc1 you mean 6.10.0-0.rc1.17, then yes |
a cli reproducer should be something like this
|
I tried to get a rawhide VM going to test myself install but seems like something with dnf is terribly broken there as I cannot install anything due checksum errors. I tried several VM's all fail in the same way... |
huh, that seems odd? I'm running Rawhide here and not seeing anything like that, and our automated tests aren't either. |
On 1mt, a minute or two ago, I saw a ton of red checksum errors but |
I do see this mail, which might be relevant. I hadn't updated to that yet. But openQA did pass tests today...which includes doing quite a lot of package installs... |
Yeah seems to be working now again, not sure what happened. |
Tried 6.10.0-0.rc1.20240528git2bfcfd584ff5.18 and can reproduce with the shell commands above, you may need to add the io controller first on a fresh boot.
I think this must be reported to the kernel upstream, I don't see this getting solved just sitting in the fedora bugzilla. |
well, @jmflinuxtx - the Fedora kernel maintainer - is aware of the issue, so I was kinda leaving it to him to report it to the appropriate upstream venues. I find it pretty impossible to know where to send kernel issues. |
Yess, I am aware, I passed this on to Waiman Long. He thought there was a patch for it and that turned out not to cover this case, so he was looking again. In the meantime, we just hit RC1 so bug fixes are coming in fast, and it is possible that someone else has a fix. Worst case, I can bisect later this week. |
Commit bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") attempts to revert the code change introduced by commit cd5ab1b ("blk-throttle: add .low interface"). However, it leaves behind the bps_conf[] and iops_conf[] fields in the throtl_grp structure which aren't set anywhere in the new blk-throttle.c code but are still being used by tg_prfill_limit() to display the limits in io.max. Now io.max always displays the following values if a block queue is used: <m>:<n> rbps=0 wbps=0 riops=0 wiops=0 Fix this problem by removing bps_conf[] and iops_conf[] and use bps[] and iops[] instead to complete the revert. Fixes: bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Reported-by: Justin Forbes <jforbes@redhat.com> Closes: containers/podman#22701 (comment) Signed-off-by: Waiman Long <longman@redhat.com>
Commit bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") attempts to revert the code change introduced by commit cd5ab1b ("blk-throttle: add .low interface"). However, it leaves behind the bps_conf[] and iops_conf[] fields in the throtl_grp structure which aren't set anywhere in the new blk-throttle.c code but are still being used by tg_prfill_limit() to display the limits in io.max. Now io.max always displays the following values if a block queue is used: <m>:<n> rbps=0 wbps=0 riops=0 wiops=0 Fix this problem by removing bps_conf[] and iops_conf[] and use bps[] and iops[] instead to complete the revert. Fixes: bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Reported-by: Justin Forbes <jforbes@redhat.com> Closes: containers/podman#22701 (comment) Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240530134547.970075-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Seen in OpenQA. No logs available, this is a weird thing that only records movies and I don't have the desire to hand-type all the error. It basically looks like
This is just a placeholder for now. Smells like kernel bug to me, but it could also be a bug on our end (including in tests). If I see this blowing up (as measured by openqa emails) I will explore further. Until then, nothing to do.
The text was updated successfully, but these errors were encountered: