Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel dynamic memory is not released again #137

Open
makoONE opened this issue Feb 2, 2024 · 12 comments
Open

Kernel dynamic memory is not released again #137

makoONE opened this issue Feb 2, 2024 · 12 comments

Comments

@makoONE
Copy link

makoONE commented Feb 2, 2024

Since I have been using PVE 8.1 with kernel 6.5, I have noticed for some time that the kernel dynamic memory is not released again. Whenever I start a VM that has allocated the GPU and shut it down again, the host's memory display remains at about the same value as if the VM was still running. A check with smem shows that the memory is no longer allocated by any processes but to the kernel dynamic memory.
With PVE 8.0 and kernel 6.2 I never experienced the described behavior.

Is anyone else here affected or knows a solution?

@brussig-tud
Copy link

Just adding that I've experienced the same kernel memory leak with kernel 6.5. For me it only happens when the host uses the iGPU for graphics output (my use case is a regular-use desktop PC with a Windows VM, not a Proxmox server) while virtual functions are enabled – even when no VM is using any virtual functions.

I never planned on passing through my dedicated NVIDIA card to the VM, so I just made sure the host uses the NVIDIA GPU always and never accesses the iGPU. Then virtual functions work fine without leaking kernel memory. I've since moved to kernel 6.6 but I don't know if the issue still persists since I'm happy with my current setup.

@devedse
Copy link

devedse commented Mar 19, 2024

@makoONE @brussig-tud
Jeez I think I've been struggling with the same issue here for the past weeks.
I initially thought this to be an LXC problem thus made a very elaborate post here:
https://discuss.linuxcontainers.org/t/lxc-container-in-proxmox-using-90-of-memory-with-all-processed-killed/19389/4

Could you guys read my post and help me how I can check if there also kernel dynamic memory allocated? (What command do I find for this)
With this I'd like to figure out if my problem is the same problem you guys are having.

And if this is the case so you have any idea how to disable sriov temporary? Do I uninstall the DKMS module or do I need to undo all steps?

@brussig-tud
Copy link

brussig-tud commented Mar 19, 2024

@devedse I'm not super knowledgable about containers and containerizing things. But if I read your post correctly, then you have SR-IOV enabled using this driver, but you don't actually use any virtual functions since you're not passing them on to VMs. Instead, you only actually use the SRIOV-enabled GPU from the host OS, since containers after all still technically run on the host.

So yeah, it very much sounds like you're facing the same issue. You can check your kernel dynamic memory usage using the smem utility:

sudo smem -twk

I don't have any output saved from when I tried, but my "kernal dynamic memory" value was 27GB once after just running a normal KDE desktop on the iGPU for about 2 hours with this module enabled.

Just dkms remove'ing the module will be enough to disable virtual functions temporarily. I did not have to do anything else to get rid of the memory leak, which pretty much proves that the i915-sriov driver is the culprit. You can always just dkms install it again later on if you need virtual functions back.

@devedse
Copy link

devedse commented Mar 19, 2024

@brussig-tud , that's exactly the answer I was looking for.

So I don't need to remove this from grub:

intel_iommu=on i915.enable_guc=3 i915.max_vfs=7

And also don't need to remove this file:

/etc/sysfs.conf

?

@brussig-tud
Copy link

brussig-tud commented Mar 19, 2024

@devedse The "vanilla" i915 driver will ignore the max_vfs kernel boot parameter, and the sysfs entry will just silently fail if the driver does not provide the endpoints, so yeah, you can leave them in place.

I don't remember whether just not creating VFs via sysfs was enough to fix the memory leak, or if you also had to set max_vfs=0, or if you had to completely disable GuC scheduling altogether (which should also cause this driver to not leak memory). You can try narrowing it down further like this, but removing the DKMS module will surely prove or disprove the hypothesis that this driver is causing your memory leak and you can leave the other things there in case you need them later.

@devedse
Copy link

devedse commented Mar 19, 2024

@brussig-tud , Thanks for the explanation.

To keep things further on topic, do you know any place to more casually discuss this stuff further? IRC/Discord? I'm curious what you all use SRIOV for.

Edit:
Here's the output of smem -twk:

root@proxmox1:~# smem -twk
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory         10.8G       6.5G       4.3G 
userspace memory              14.6G     774.0M      13.9G 
free memory                    5.7G       5.7G          0 
----------------------------------------------------------
                              31.1G      12.9G      18.2G

So indeed I also seem to be using quite some kernel dynamic memory.

@brussig-tud
Copy link

brussig-tud commented Mar 19, 2024

@devedse do you know any place to more casually discuss this stuff further? IRC/Discord?
No idea, sorry... As for me, I just need a VM with a working virtualized GPU for cross-platform graphics development. But I don't want to pass through my whole NVIDIA GPU, and passing through the full iGPU usually doesn't work for Windows guests, whereas mapping a virtual function to the VM works really well.

In general, I think SR-IOV is mainly used on NICs as a sort of high-performance ethernet bridge for VMs.

@devedse
Copy link

devedse commented Mar 19, 2024

@brussig-tud , I just removed the dkms module and rebooted the system. Now the whole /dev/dri folder seems to be missing though. Am I missing the normal drivers or something to get the intel N100 working again?

I played around a bit and I found out that reverting to kernel 6.2 seems to solve the issue. Does the 6.5 kernel not actually have an i915 driver included?

@brussig-tud
Copy link

brussig-tud commented Mar 20, 2024

@devedse I have actually no experience with Proxmox whatsoever, but that seems very unlikely to me (after all every other Debian-based distro usually packages the i915 driver for every officially available kernel version). You can try to modprobe i915 on the 6.5 kernel and see if it tells you something.

\edit you should definitely check what driver is being assigned to the iGPU using lspci -nnk.

If everything else fails, keeping the i915-sriov DKMS driver with num_vfs=0 (and potentially disabled GuC scheduling, i.e. enable_guc=2) might get rid of the memory leak also. If you want fully accellerated hardware media encoding you need HuC firmware loading, so no enable_guc=1 or lower which would be the default if you omit the kernel parameter.

@devedse
Copy link

devedse commented Mar 20, 2024

Apparently the problem was that the "i915.ko" file seemed to be missing in the modules folder.

I had to reinstall the kernel by doing the following:

dpkg --search /usr/lib/modules/<kernel version directory>

apt-get --reinstall install proxmox-kernel-6.5.13-1-pve-signed

That fixed my issues

@gfgjs
Copy link

gfgjs commented Apr 8, 2024

yes, I also encountered this issue, but after I rolled back the PVE kernel to 6.2.16-20-PVE, the memory usage was normal, and SRIOV could also be used normally.

@azerty9971
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants