-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cmake output saying 'Configuring kernel module without CUDA' #28
Comments
Hi, It's strange that it is that it is outputting that it found the Nvidia driver, but the string appears empty. Regardless, it is possible to override the location using |
But I can't found the path to driver source. I my /usr/src/ directory there is no nvidia-driver--.
I try to find it with the
|
I'm not really familiar with the Xavier, but you can try locating the directory manually by running |
You can attempt using the either Hope it works! |
Hi,
|
Hi, Yes, it's looking for the Module.symvers but on the Tegra the Nvidia driver seems to be compiled in to the kernel. You can either try modifying this line and remove the check for Module.symvers If that doesn't work, you could try modifying this line and add The third option is perhaps modifying the generated Makefile for the kernel module after running CMake. |
It may actually be easier to just add You may need to include the header file location, so the line becomes: |
I think the problem comes from finding Nvidia driver symbols (lines 66 - 68). I tried to locate the
So I checked in CMake, and remove the fird condition in the line 144
The commands |
It appears that the symbols are wildly different on the Xavier/Tegra kernel than for x86. Regardless, my understanding is that the main system memory is shared by the on-board GPU on all of the Tegra SoCs, so it might be the case that those calls aren't really necessary to begin with. I need to do some more investigations into that. |
I make some modification in the
After these modifications, the module has been successfully compiled without any error. However, when I tried the
|
Great!
If you have loaded the module, then you should invoke the Apologies for the documentation, it isn't really up to date. That being said, it seems to me that the DMA is not working properly. Additionally, I also believe that some of the Tegras are not cache coherent. If Xavier isn't as well, then you might need to add code to flush the cache in the queue functions. |
I have loaded the module
I run the
I also verified if the IOMMU is enabled with I looked for DMA faults with
|
This is not the character device created by the nvme module, this is the block device created by the built-in Linux NVMe driver. You need to unbind the driver for the NVMe and then reload the libnvm driver. |
Sorry, I don't nkow how to unbind the driver for the NVMe and then reload the libnvm driver |
No problem,
Reloading |
Thank I run the echo command with sudo but I obtain a permission denied.
When I pllugged my SSD into the Xavier M.2 Key M slot, I formated in Ext4, mounted it a repertory (located in root), and added this |
Yeah, it's most likely mounted. Try unmounting it (using |
My apologies for the late response, I was not working yesterday. So I have unmounted the SSD and commented the line in fstab, and rebooted the Xavier. Next, i tried to unbind the driver for NVMe with
but I still obtain a permission denied mesage :
|
I suspect that the problem is that sudo is not evaluating the pipe operator, so that only echo is ran with elevated privileges. You could start a shell (using After unbinding the NVMe driver and loading the libnvm module, please confirm running |
With
However, running the |
It sounds to me like the system is crashing. Is it possible to run |
I run Now, I want tu run the CUDA example, so I run |
You need to run Look at the output from running with the |
It seems that this happens quite a while after loading the module, but I am unsure what causes it. I initially suspected that there was an issue with memory corruption, but it doesn't seem to be the case since it completed with what appears to be correct data in your screenshot above. I have access to a Xavier at work, but I doubt I will have time to look into this until over new years. Any input or experience you have testing this out is very valuable to me, so I appreciate it. |
I tried to reproduce the output seen on the screenshot seen above but the system continue to crash. I saved the output of I am not familiar with kernel log and driver. So I will read some documentations as well as nvme specifications and try to understand why there is this problem. |
Out of interest, could you try unloading the libnvm module/rebooting and run the |
I rebooted the system, reloaded the libnvm and launched the |
Sorry, I meant not loading the libnvm module at all. Just unbinding the kernel NVMe driver, so that |
Hi, I rebooted the system, unbind the kernel NVMe driver. |
Thank you for testing this for me. I just realized that you have a SATA controller on the other PCI domain, so one last thing to test, just in case is the BDF parsing is wrong is to do the same as above but also You don't need to wait 30 minutes, I think if it freezes and does not immediately return, it's safe to assume that it has stalled. I'm really not sure what is going wrong. It seems really strange that it stalls the system like this, I will have to take a look at it over new years. We could try adding print statements between the individual steps in the identify userspace example ( Again, thank you so much for testing it. |
I did the same as above with |
I added some print statements in In the last line of the attached file The thirs file |
So if the IOMMU is enabled, that explains why it hangs in the However, previously the identify operation did succeed (when you used the kernel module), but I also see in those logs that the IOMMU most likely was on (which is strange). |
I will try to disable the SMMU for the PCIe controller-0 (on witch my SSD is connected according to the screenshot where seeing the IOMMU context fault error message). Maybe that will solve the problem ! |
Hi, To disable SMMU for PCIe controller 0, I modified the device tree and used instructions in comment #4 of https://devtalk.nvidia.com/default/topic/1043746/jetson-agx-xavier/pcie-smmu-issue/ and then reflash my Xavier board with the new device tree binary. I verified that SMMU is disabled by extracting the current device tree on my Xavier and I found that there no entry for the SMMU. Next, I tried to run the |
Hi, Did you have time to look what is going wrong on the Xavier platform? I posted the problem on Nvidia forum ( https://devtalk.nvidia.com/default/topic/1069024/jetson-agx-xavier/pcie-smmu-issues-on-pcie-c0-with-an-nvme-ssd-connected-to-the-m-2-key-m-slot/ ). It seems that Xavier AGX does not support the PCIe P2P protocol and this can explain the behavior showed. |
Hi, sorry for the late reply. Yes, I have discussed this with one of my colleagues that is more familiar with Tegras/Xavier than me, and yes, I don't think it is possible to disable the SMMU/IOMMU, which is going to disrupt peer-to-peer DMA. Some time in the future, I will look into using the IOMMU API for the kernel module (SmartIO/SISCI already supports this, which is why it is not prioritized), but don't expect this to be soon. If you can live with the limitations of not using peer-to-peer, @cooldavid implemented VFIO support for the identify controller example: #23 |
Hi, I tested the implementation mentioned in #23 but my Xavier still continues to freeze. I can live with the limitations of not using peer-to-peer. In fact, I want to be able to perform DMA transfers between the SSD and Xavier system memory (since access to the GPU memory is seemingly not possible on the Xavier). If GPU memory is not used, there is no need for the module (as it is the one that contains Nvidia files using peer-to-peer). Would it be possible to modify the sources of the ssd-gpu-nvm project to perform DMA transfers between the CPU and the SSD without using peer-to-peer not all? I'm not familiar with driver writing. But I have read some documents on NVMe specifications. If kip the default Nvidia nvm driver loaded, would it be possible to do something similar as in the Cuda example, but without using GPU memory. I mean interact with the NVMe controller and perform DMA transfers through the different stages that we see in the figure below? |
It should be possible, but this is exactly what the identify example does but the system freezes. It may be your modifications to the kernel module, you can try recompiling it without CUDA support and the the modified calls to the various nvidia functions. The problem is the IOMMU/SMMU: If it isn't possible to disable, then there must be some code that sets up the correct mappings so that the I/O addresses are translated into the correct physical addresses. However, if the VFIO example also caused the system to freeze, there must be something else at fault. In this case, my kernel module should not be in use at all and the Linux kernel should be able to set up the correct IOMMU groups. I don't know what is wrong in this case. |
Hi;
I have a Jetson Xavier AGX kit board and I plugged into the M.2 key M an NVMe SSD. Now, I'm trying to install your libnm on my Xavier and I show the following message in CMake output:
How can I force Cmake to build with CUDA?
Thank
The text was updated successfully, but these errors were encountered: