Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control display is undefined #12

Open
rajamarwah opened this issue Dec 19, 2018 · 19 comments
Open

Control display is undefined #12

rajamarwah opened this issue Dec 19, 2018 · 19 comments

Comments

@rajamarwah
Copy link

rajamarwah commented Dec 19, 2018

Please help

Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.

Number of Fans detected:
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.

Number of GPUs detected:
./temp.sh: line 184: [: : integer expression expected
Submit an issue on my GitHub page... happy to fix this :D
@nan0s7
Copy link
Owner

nan0s7 commented Dec 20, 2018

Hmm... can I have some more information about your setup? Like are you using the X display server, do you have coolbits enabled, etc.

Post the output of the two commands nvidia-settings -q dpys and nvidia-settings -q screens please.

It sounds like you have a strange display configuration, which doesn't use the default display :0.

@rajamarwah
Copy link
Author

I have a 6 GPU (1080 Ti) with Asus Prime z370-a motherboard and linux 18.04. The result of both the commands is:
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

Output of Nvidia-smi is:

-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:01:00.0 Off | N/A |
| 0% 28C P8 13W / 160W | 31MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:02:00.0 Off | N/A |
| 0% 24C P8 10W / 160W | 9MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... On | 00000000:04:00.0 Off | N/A |
| 0% 24C P8 11W / 160W | 9MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... On | 00000000:05:00.0 Off | N/A |
| 0% 27C P8 11W / 160W | 9MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... On | 00000000:06:00.0 Off | N/A |
| 0% 26C P8 10W / 160W | 9MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... On | 00000000:08:00.0 Off | N/A |
| 0% 27C P8 8W / 160W | 9MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1155 G /usr/lib/xorg/Xorg 21MiB |
| 0 1622 G /usr/bin/gnome-shell 7MiB |
| 1 1155 G /usr/lib/xorg/Xorg 6MiB |
| 2 1155 G /usr/lib/xorg/Xorg 6MiB |
| 3 1155 G /usr/lib/xorg/Xorg 6MiB |
| 4 1155 G /usr/lib/xorg/Xorg 6MiB |
| 5 1155 G /usr/lib/xorg/Xorg 6MiB |
+-----------------------------------------------------------------------------+

@nan0s7
Copy link
Owner

nan0s7 commented Dec 23, 2018

This is weird; it seems like the nvidia drivers aren't finding your X-server. This isn't a problem with my script, but I'm happy to help as much as I can.

What's the output of lspci -nnk? It might show that the wrong drivers are in use. Also how are you controling your machine? Do you have a desktop environment running? I can see Gnome-shell is running but I'm not sure if that can happen in the background or something.

@rajamarwah
Copy link
Author

Sincerely appreciate the help and support in troubleshooting.

I have a 18.04 desktop environment running but currently display is disabled (maybe due to my tweaking -- Noob).

Here's the output:

00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e1f] (rev 08)
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 08)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:01.1 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x8) [8086:1905] (rev 08)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:3e91]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
Kernel driver in use: i915
Kernel modules: i915
00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af]
Subsystem: ASUSTeK Computer Inc. 200 Series PCH USB 3.0 xHCI Controller [1043:8694]
Kernel driver in use: xhci_hcd
00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
Subsystem: ASUSTeK Computer Inc. 200 Series PCH CSME HECI [1043:8694]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282]
Subsystem: ASUSTeK Computer Inc. 200 Series PCH SATA controller [AHCI mode] [1043:8694]
Kernel driver in use: ahci
Kernel modules: ahci
00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1b.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #21 [8086:a2eb] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.1 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #2 [8086:a291] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #5 [8086:a294] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.6 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #7 [8086:a296] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1d.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #9 [8086:a298] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c9]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1]
Subsystem: ASUSTeK Computer Inc. 200 Series PCH PMC [1043:8694]
00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3]
Subsystem: ASUSTeK Computer Inc. 200 Series PCH SMBus Controller [1043:8694]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V [1043:8672]
Kernel driver in use: e1000e
Kernel modules: e1000e
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:4471]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:4471]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
04:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:2471]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
06:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:2471]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
07:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:2142]
Subsystem: ASUSTeK Computer Inc. Device [1043:8756]
Kernel driver in use: xhci_hcd
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:2471]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:2471]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

@nan0s7
Copy link
Owner

nan0s7 commented Dec 23, 2018

Yeah I think that may be the issue; not having a display enabled. You should be able to have a faux display running if you're into that, which may convinse Nvidia enough so that you can use my script.

There may be another way around this, where I could try controlling the fans without the use of nvidia-settings. However, I don't know how long it'd take to get that working... :P

From a few threads I've found it says you need to have an x-display running on each GPU for nvidia-settings to work.

This may help point you in the right direction:
https://devtalk.nvidia.com/default/topic/1024489/nvidia-settings-on-headless-server/

@rajamarwah
Copy link
Author

Lol I just followed that thread an hour ago myself and manage to manually get the control over Nvidia GPU's but I guess I can't utilize your script for now (which is a pity). Thanks again for all the help and guidance.

@nan0s7
Copy link
Owner

nan0s7 commented Dec 23, 2018

No problem! Hope you get things how you would like them :)

I'll keep this issue open to remind myself to look into fan control without nvidia-settings to see if it's possible. If it is, it shouldn't be too hard to add in. :D

@dnovischi
Copy link

Tring to run this script manually works as expected, however it seems it can't be used as a service due to the error described in the following logs:

>$ systemctl --user status nfancurve.service
● nfancurve.service - Nfancurve service
Loaded: loaded (/etc/systemd/user/nfancurve.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Vi 2019-11-15 18:33:22 EET; 13s ago
Process: 14194 ExecStart=/bin/sh /opt/nvidia-fan-control/temp.sh (code=exited, status=1/FAILURE)
Main PID: 14194 (code=exited, status=1/FAILURE)

>$ journalctl _PID=14194
-- Logs begin at Vi 2019-11-15 16:22:07 EET, end at Vi 2019-11-15 18:34:46 EET. --
nov 15 18:34:28 dan-pc sh[14194]: ################################################################################
nov 15 18:34:28 dan-pc sh[14194]: # nan0s7's script for automatically managing GPU fan speed #
nov 15 18:34:28 dan-pc sh[14194]: ################################################################################
nov 15 18:34:28 dan-pc sh[14194]: Configuration file: /opt/nvidia-fan-control/config
nov 15 18:34:28 dan-pc sh[14194]: Failed to connect to Mir: Failed to connect to server socket: No such file or directory
nov 15 18:34:28 dan-pc sh[14194]: Unable to init server: Could not connect: Connection refused
nov 15 18:34:28 dan-pc sh[14194]: ERROR: The control display is undefined; please run nvidia-settings nov 15 18:34:28 dan-pc sh[14194]: --help for usage information.
nov 15 18:34:28 dan-pc sh[14194]: No Fans detected

>$ nvidia-settings -q screens
1 X Screen on dan-pc:0

[0] dan-pc:0.0 (GeForce GTX 1080)

  Has the following name:
    SCREEN-0

PS: I have tried various modifications of the service file, all give the same error.

@nan0s7
Copy link
Owner

nan0s7 commented Nov 18, 2019

Yeah this isn't a problem with the service file itself, but the way the script is run. By default, NVIDIA should set the display to ":0", but I guess since we're running it from another program, there's no default display set. You can fix this manually in your service file by adding a parameter to the execution of the script: -d 0. Not sure if it needs a colon (:0) though.

I was thinking of adding an option to set this via the config file... so I guess this is a good reason to put it in there! :P

@cj360
Copy link

cj360 commented Aug 22, 2020

I think I'm having a similar issue, with the systemctl user service not starting at boot due to:

Aug 21 15:46:18 danam4 sh[33353]: Unable to init server: Could not connect: Connection refused Aug 21 15:46:18 danam4 sh[33353]: ERROR: The control display is undefined; please run nvidia-settings > Aug 21 15:46:18 danam4 sh[10437]: Fan control set back to auto mode

Starting the script myself has no such issue. Should the -d 0 be in my .service file like:
ExecStart=/bin/sh /usr/bin/nfancurve -c -d 0 /etc/nfancurve.conf ?

@nan0s7
Copy link
Owner

nan0s7 commented Aug 26, 2020

Sorry for the delay! Yeah if you start the script manually with -d 0, then you would probably need the same values in the service file.

However, if you don't usually need to specify the display when running the script manually, this issue could be related to another one that is currently open that is to do with the service file.

Try changing your service file to something like:

[Unit]
Description=Nfancurve service
After=graphical.target

[Service]
ExecStart=/bin/sh /usr/bin/nfancurve -c /etc/nfancurve.conf
KillSignal=SIGINT

[Install]
WantedBy=default.target

Let me know how that goes.

@wojciechGaudnik
Copy link

I have the same issue. When I tested -d :0 and :1 directly from cmd, all go smoothly, :0 works and :1 doesn't, and that is correct.
My service file:
[Unit]
Description=Nfancurve service
After=default.target

[Service]
ExecStart=/bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf
KillSignal=SIGINT

[Install]
WantedBy=default.target

when I run /bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf from console it works, service doesn't.
Any suggestions are welcome.

@nan0s7
Copy link
Owner

nan0s7 commented Feb 24, 2021

I have the same issue. When I tested -d :0 and :1 directly from cmd, all go smoothly, :0 works and :1 doesn't, and that is correct.
My service file:
[Unit]
Description=Nfancurve service
After=default.target

[Service]
ExecStart=/bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf
KillSignal=SIGINT

[Install]
WantedBy=default.target

when I run /bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf from console it works, service doesn't.
Any suggestions are welcome.

Are your logs the same as the above? What's the actual error?

@wojciechGaudnik
Copy link

Error is exactly the same, Unable to init server. But I resolve my problem with:
ExecStart=xinit /opt/nfancurve/temp.sh 01:00.0 run_forever -- :1 -once
I don't need a monitor so for me it works.

@nan0s7
Copy link
Owner

nan0s7 commented Mar 8, 2021

Interesting, I'll have to look into that.

@riaqn
Copy link

riaqn commented Jan 6, 2022

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

@nan0s7
Copy link
Owner

nan0s7 commented Jan 11, 2022

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

Good question. I personally haven't done any playing around with it so I am not sure. It just depends on whether you can get nvidia-settings to work without Xorg (or by using some sort of dummy display).

If you find anything or figure it out please let me know.

@Cabu
Copy link

Cabu commented Dec 4, 2022

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

As nan0s7, If you find out, i am interested too :)

@XinzeZhang
Copy link

XinzeZhang commented Jan 4, 2023

I have the same issue when using the project remotely by ssh, the error is as follow:

$ sudo bash temp.sh
Configuration file: /home/xinze/Documents/Github/nfancurve/config
Unable to init server: Could not connect: Connection refused
ERROR: The control display is undefined; please run nvidia-settings --help for usage information.
No Fans detected

====
Finally, I found the reason and the solution to this problem. As pointed in https://xinzezhang.github.io/2021/09/01/control-gpu.html, the NVIDIA controlling software generally requires logging into the GUI Desktop. To successfully execute the temp.sh, I simply complement the command with the xauth credentials as:

sudo DISPLAY=:0 XAUTHORITY=/run/user/110/gdm/Xauthority bash temp.sh

where the user id for the 'gdm' user is get as introduced in the link mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants