Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amd = "No GPU to monitor" #250

Open
ehartford opened this issue Nov 2, 2023 · 12 comments
Open

amd = "No GPU to monitor" #250

ehartford opened this issue Nov 2, 2023 · 12 comments

Comments

@ehartford
Copy link

Hello I get an error message "No GPU to monitor" even though my cards are displaying in rocm-smi and lspci

(textgen) eric@quixi1:~/text-generation-webui$ rocm-smi --showproductname


========================= ROCm System Management Interface =========================
=================================== Product Info ===================================
GPU[0]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[0]          : Card model:           0x0c34
GPU[0]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             D3431401
GPU[1]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[1]          : Card model:           0x0c34
GPU[1]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]          : Card SKU:             D3431401
GPU[2]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[2]          : Card model:           0x0c34
GPU[2]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[2]          : Card SKU:             D3431401
GPU[3]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[3]          : Card model:           0x0c34
GPU[3]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[3]          : Card SKU:             D3431401
GPU[4]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[4]          : Card model:           0x0c34
GPU[4]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[4]          : Card SKU:             D3431401
GPU[5]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[5]          : Card model:           0x0c34
GPU[5]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[5]          : Card SKU:             D3431401
GPU[6]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[6]          : Card model:           0x0c34
GPU[6]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[6]          : Card SKU:             D3431401
GPU[7]          : Card series:          Arcturus GL-XL [Instinct MI100]
GPU[7]          : Card model:           0x0c34
GPU[7]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[7]          : Card SKU:             D3431401
====================================================================================
=============================== End of ROCm SMI Log ================================
(textgen) eric@quixi1:~/text-generation-webui$ lspci | egrep -i "display|vga"
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
23:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
26:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
43:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
64:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
83:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
a3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
c3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
c6:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
(textgen) eric@quixi1:~/text-generation-webui$ nvtop 
No GPU to monitor.
@qwertychouskie
Copy link
Contributor

It's not surprising this isn't working given that the card is based on CDNA rather than GCN or RDNA. It's very well possible that kernel APIs are missing, and even if not, I doubt any dev off nvtop has a test card available to them. I personally would be inclined to close this issue as wontfix, but @Syllo would know better than I would if implementing support is a possibility or not.

@Syllo
Copy link
Owner

Syllo commented Feb 23, 2024

If I had access to such card I could try and add support if there is a way to discover these GPUs. If it's not registering through the drm driver I'm not surprised it's not showing in nvtop.

@supernovae
Copy link

Same problem with 7900xtx

bymiller@byron-X570:~$ nvtop
No GPU to monitor.

rocm-smi --showproductname

============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card series: 0x744c
GPU[0] : Card model: 0x2422
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: EXT84765

================================== End of ROCm SMI Log ===================================

@supernovae
Copy link

bymiller@byron-X570:~$ sudo dmesg | grep drm
[ 3.645163] ACPI: bus type drm_connector registered
[ 4.921387] [drm] amdgpu kernel modesetting enabled.
[ 4.921389] [drm] amdgpu version: 6.3.6
[ 4.921390] [drm] OS DRM version: 6.5.0
[ 4.935928] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x148C:0x2422 0xC8).
[ 4.935939] [drm] register mmio base: 0xFCC00000
[ 4.935940] [drm] register mmio size: 1048576
[ 4.940610] [drm] add ip block number 0 <soc21_common>
[ 4.940612] [drm] add ip block number 1 <gmc_v11_0>
[ 4.940613] [drm] add ip block number 2 <ih_v6_0>
[ 4.940614] [drm] add ip block number 3
[ 4.940615] [drm] add ip block number 4
[ 4.940617] [drm] add ip block number 5
[ 4.940618] [drm] add ip block number 6 <gfx_v11_0>
[ 4.940619] [drm] add ip block number 7 <sdma_v6_0>
[ 4.940620] [drm] add ip block number 8 <vcn_v4_0>
[ 4.940621] [drm] add ip block number 9 <jpeg_v4_0>
[ 4.940622] [drm] add ip block number 10 <mes_v11_0>
[ 4.946689] [drm] VCN(0) encode/decode are enabled in VM mode
[ 4.946691] [drm] VCN(1) encode/decode are enabled in VM mode
[ 4.947687] amdgpu 0000:0a:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[ 4.949179] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 4.949194] [drm] Detected VRAM RAM=24560M, BAR=32768M
[ 4.949196] [drm] RAM width 384bits GDDR6
[ 4.949283] [drm] amdgpu: 24560M of VRAM memory ready
[ 4.949284] [drm] amdgpu: 32107M of GTT memory ready.
[ 4.949299] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 4.949364] [drm] PCIE GART of 512M enabled (table at 0x0000008001300000).
[ 4.949718] [drm] Loading DMUB firmware via PSP: version=0x07002100
[ 4.950212] [drm] Found VCN firmware Version ENC: 1.16 DEC: 5 VEP: 0 Revision: 6
[ 5.020366] [drm] reserve 0x1300000 from 0x85fc000000 for PSP TMR
[ 5.349912] [drm] Display Core v3.2.255 initialized on DCN 3.2
[ 5.349914] [drm] DP-HDMI FRL PCON supported
[ 5.351765] [drm] DMUB hardware initialized: version=0x07002100
[ 5.569780] [drm] kiq ring mec 3 pipe 1 q 0
[ 5.577336] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 5.577955] amdgpu 0000:0a:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 5.682527] [drm] ring gfx_32768.1.1 was added
[ 5.682796] [drm] ring compute_32768.2.2 was added
[ 5.683000] [drm] ring sdma_32768.3.3 was added
[ 5.683057] [drm] ring gfx_32768.1.1 ib test pass
[ 5.683107] [drm] ring compute_32768.2.2 ib test pass
[ 5.683132] [drm] ring sdma_32768.3.3 ib test pass
[ 5.684846] [drm] Initialized amdgpu 3.56.0 20150101 for 0000:0a:00.0 on minor 0
[ 5.691939] fbcon: amdgpudrmfb (fb0) is primary device
[ 5.691942] amdgpu 0000:0a:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[ 5.788512] [drm] DSC precompute is not needed.
[ 6.253192] systemd[1]: Starting Load Kernel Module drm...
[ 6.258790] systemd[1]: modprobe@drm.service: Deactivated successfully.
[ 6.258990] systemd[1]: Finished Load Kernel Module drm.

@ehartford
Copy link
Author

If I had access to such card I could try and add support if there is a way to discover these GPUs. If it's not registering through the drm driver I'm not surprised it's not showing in nvtop.

Hey I'm happy to give you access to my server

@johncadengo
Copy link

johncadengo commented Mar 30, 2024

I'm experiencing this issue as well

@numas
Copy link

numas commented Apr 25, 2024

This also happens for me on a Radeon RX 7900 XTX (as well as on a Radeon RX 7900 XT)

====================================== ROCm System Management Interface ======================================
================================================ Concise Info ================================================
Device  [Model : Revision]    Temp    Power  Partitions      SCLK     MCLK   Fan  Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                  
==============================================================================================================
0       [0x471e : 0xc8]       30.0°C  69.0W  N/A, N/A        1564Mhz  96Mhz  0%   auto  303.0W    0%   56%   
        0x744c                                                                                               
==============================================================================================================
============================================ End of ROCm SMI Log =============================================

@numas
Copy link

numas commented May 3, 2024

Added the ids for RX 7900 XTX / XT myself to src/amdgpu_ids.h - it works now: #293

Regarding the MI100 card, I would guess the line would be:

{0x0C34, 0x01, "AMD Instinct MI100"},

@Umio-Yasuno
Copy link

Added the ids for RX 7900 XTX / XT myself to src/amdgpu_ids.h - it works now: #293

Regarding the MI100 card, I would guess the line would be:

{0x0C34, 0x01, "AMD Instinct MI100"},

Really? What you are adding is the SubDeviceID, not the DeviceID, and nvtop doesn't use the SubDeviceId.
And amdgpu_ids.h is only used to get the name.

@numas
Copy link

numas commented May 4, 2024

Really? What you are adding is the SubDeviceID, not the DeviceID, and nvtop doesn't use the SubDeviceId. And amdgpu_ids.h is only used to get the name.

Well, nvtop went from "No GPU to monitor" to this:

nvtop_7900xtx

This is the information on 7900 XTX from https://gitlab.freedesktop.org/mesa/drm/-/blob/main/data/amdgpu.ids

744C, C8, AMD Radeon RX 7900 XTX

I took 0x471e from rocm-smi but this is the SubDeviceId? As you can see there are some missing info (N/A) so that may be because of this? I'll try with 0x744c as I somehow missed that.

Just weird that the OP couldn't get nvtop to start with the MI100 as the DeviceID in amdgpu_ids.h should be correct.

UPDATE: Changed to DeviceID 0x744C in amdgpu_ids.h and the nvtop output is identical to the screenshot above. Is more code needed to support the 7900 XTX / XT in nvtop than adding the DeviceID?

@Umio-Yasuno
Copy link

@numas
Hmm, have you tried the unpatched build?
nvtop gets the device name from libdrm_amdgpu, and uses amdgpu_ids.h list when that fails.
The driver name, such as "AMD GPU", is used even if the list does not contain the device name.
So it seems strange that adding the device name to the list makes it recognize the device.

@numas
Copy link

numas commented May 4, 2024

Thank you @Umio-Yasuno !

You are correct, the unpatched build works (though still with some N/A info) - I was comparing with the distro provided nvtop which is old (1.2.2 in Ubuntu 22.04) and went straight to hacking instead of checking a clean build first...

Sorry for the noise, I will remove the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants