Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes assigning new temperature show me "Unknown Error" #23

Open
saippuakauppias opened this issue Nov 26, 2019 · 14 comments
Open

Sometimes assigning new temperature show me "Unknown Error" #23

saippuakauppias opened this issue Nov 26, 2019 · 14 comments

Comments

@saippuakauppias
Copy link

saippuakauppias commented Nov 26, 2019

I run

DISPLAY=:0 XAUTHORITY=/run/user/120/gdm/Xauthority sh temp.sh

and get:

Configuration file: /home/sks/nfancurve/config
Number of Fans detected: 1
Number of GPUs detected: 1

  Attribute 'GPUFanControlState' (vidserv1:0[gpu:0]) assigned value 1.

Started process for 1 GPU and 1 Fan

  Attribute 'GPUTargetFanSpeed' (vidserv1:0[fan:0]) assigned value 25.

But after 1 or several hours (absolutely random) I get error:

ERROR: Error assigning value 40 to attribute 'GPUTargetFanSpeed' (vidserv1:0[fan:0]) as specified in assignment
'[fan:0]/GPUTargetFanSpeed=40' (Unknown Error).

I dont understand this is bug or my configuration wrong.

Now I try to test simple patch (maybe it be useful for you and anyone):

I changed: https://github.com/nan0s7/nfancurve/blob/v019.2/temp.sh#L84-L86

to:

set_speed() {
        $gpu_cmd -a [fan:"$fan"]/GPUTargetFanSpeed="$cur_spd" $display
        if [ "$?" -ne "0" ]; then
                echo 'error change temp, try to fix it'
                set_fan_control "$num_gpus_loop" "1"
        fi
}
@nan0s7
Copy link
Owner

nan0s7 commented Nov 28, 2019

Hmm yeah it doesn't look like it's a problem with the script itself, but I can't be sure.

You can try setting the DISPLAY variable via the script directly; so like temp.sh -d ":0", which sets the display when using a command instead of a global variable.

If that doesn't work, let me see the output of the script when you do the above command and with the extra log option of -l.

Let me know how that goes.

@saippuakauppias
Copy link
Author

saippuakauppias commented Nov 29, 2019

My patch from first post not worked :(


Log without set display variable:

Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.

temp.sh: 116: [: Illegal number:

Log with set display variable:

 t=60 ot=55 td=5 s=7 gpu=0 fan=0 cd=6 nsp=55 osp=0 maxt=75 mint=25 otl=2
 t=60 ot=55 td=5 s=7 gpu=0 fan=0 cd=6 nsp=55 osp=0 maxt=75 mint=25 otl=2
 t=60 ot=55 td=5 s=7 gpu=0 fan=0 cd=6 nsp=55 osp=0 maxt=75 mint=25 otl=2


ERROR: Error assigning value 70 to attribute 'GPUTargetFanSpeed' (vidserv1:0[fan:0]) as specified in assignment
       '[fan:0]/GPUTargetFanSpeed=70' (Unknown Error).


 t=61 ot=55 td=0 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=61 ot=61 td=0 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3
 t=60 ot=61 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3

and

 t=63 ot=61 td=2 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3
 t=64 ot=61 td=3 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3
 t=63 ot=61 td=2 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3

(nvidia-settings:1704): dbind-WARNING **: 21:33:47.450: Error retrieving accessibility bus address: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
 t=64 ot=61 td=3 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3
 t=63 ot=61 td=2 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3
 t=63 ot=61 td=2 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=0 maxt=75 mint=25 otl=3

and

 t=28 ot=30 td=2 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0
 t=28 ot=30 td=2 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0
 t=28 ot=30 td=2 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0
Unable to init server: Could not connect: Connection refused

ERROR: Unable to find display on any available system


ERROR: Unable to find display on any available system

temp.sh: 116: [: Illegal number:
 t= ot=30 td=2 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0
 t=32 ot=30 td=2 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0
 t=33 ot=30 td=3 s=7 gpu=0 fan=0 cd=6 nsp=25 osp=25 maxt=75 mint=25 otl=0

and

 t=64 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=64 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=64 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
Gdk-Message: 18:34:33.761: nvidia-settings: Fatal IO error 0 (Success) on X server :0.

temp.sh: 116: [: Illegal number:
 t= ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=64 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=64 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3
 t=65 ot=64 td=1 s=7 gpu=0 fan=0 cd=6 nsp=70 osp=70 maxt=75 mint=25 otl=3

@nan0s7
Copy link
Owner

nan0s7 commented Nov 30, 2019

Yeah that's definitely not an issue with my script. What kind of configuration are you working with? I notice that you're connecting remotely; if you're using some sort of headless setup, I'm not sure if nvidia-settings supports such a configuration. Although I see you have GDK messages in the log so I'm not sure. I guess the log that says:
Unable to init server: Could not connect: Connection refused ERROR: Unable to find display on any available system
This indicates that somewhere along the line the GPU doesn't have an active display attached.

There are ways to force a display to be detected (even when there isn't a physical display detected) but that's outside of my field of expertise, aside from just connecting an old display or getting a fake display adapter.

The strange part of this is that it works sometimes. I guess that's only when you have a remote connection to the client active. Perhaps it has something to do with whatever power saving settings you have?

When the script has the error at line 116, saying Illegal number, that means the script couldn't access the command (via nvidia-settings) to get the GPU temperature.

@saippuakauppias
Copy link
Author

saippuakauppias commented Dec 12, 2019

Yes, I use this server without a display. And, unfortunately, there is no way to connect something there.

With logging and indicating the display in the command, it became clearly better, but still, sometimes problems arise.

Maybe you know some kind of 100% solution to avoid such errors and management did not stop?

The fact is that I use your project in the server to train neural networks. And I know a lot of people who do the same thing, but just set the fan settings to maximum, although this is a dubious undertaking. I think they can also face such a problem if they take advantage of your solution ...

PS: This problem occurs even when I am disconnected from the remote server. And I'm not sure that the problem is with the energy-saving settings, the rest of the processes are working properly.

@nan0s7
Copy link
Owner

nan0s7 commented Dec 13, 2019

I found something that may help, so please try this and get back to me with the results. If it works I'll add it to my script so it works automatically.

With your original first patch DISPLAY=:0 XAUTHORITY=/run/user/120/gdm/Xauthority sh temp.sh, are there any other display files around that /run/user/120/gdm/ folder? Or perhaps is there something like /var/run/gdm/root/:0 ?

If not, have a look at this:
https://virtualgl.org/Documentation/HeadlessNV
I don't know if this works on every Nvidia GPU though.

Otherwise, I may need to know more about your setup to help, if there is even a solution. It does seem that it is possible to run nvidia-settings without a connected display though, it just depends on what programs you're using.

@saippuakauppias
Copy link
Author

saippuakauppias commented Dec 20, 2019

root@vidserv1:~# ls -la /run/user/120/gdm/
total 4
drwx--x--x 2 gdm gdm  60 Dec 20 13:39 .
drwx------ 9 gdm gdm 200 Dec 20 14:19 ..
-rwx------ 1 gdm gdm 104 Dec 20 14:19 Xauthority

root@vidserv1:~# ls -la /var/run/gdm/root/:0
ls: cannot access '/var/run/gdm/root/:0': No such file or directory

root@vidserv1:~# ls -la /var/run/gdm3
gdm3/     gdm3.pid

root@vidserv1:~# ls -la /var/run/gdm
ls: cannot access '/var/run/gdm': No such file or directory

root@vidserv1:~# ls -la /var/run/gdm3
total 0
drwx--x--x  3 root gdm    60 Dec 19 19:58 .
drwxr-xr-x 32 root root 1120 Dec 20 14:12 ..
drwx------  2 gdm  gdm    40 Dec 19 19:58 greeter

Line DISPLAY=:0 XAUTHORITY=/run/user/120/gdm/Xauthority was found with:

  1. https://askubuntu.com/questions/967955/ubuntu-17-10-on-wayland-how-can-i-install-the-nvidia-drivers

  2. https://devtalk.nvidia.com/default/topic/1032741/linux/tuning-nvidia-settings-over-ssh-error/post/5254249/#5254249

Now:

root@vidserv1:~# ps a |grep X
 3906 tty1     Sl+    0:00 /usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -background none -noreset -keeptty -verbose 3
  1. https://gist.github.com/lucidyan/4359b5973e5c3cee818595734c0ab7a9#gistcomment-2794677

root@vidserv1:~# nvidia-xconfig --query-gpu-info
Number of GPUs: 1

GPU #0:
  Name      : GeForce GTX 1080 Ti
  UUID      : GPU-62ed4e67-ea04-7a5d-153f-98841a97819f
  PCI BusID : PCI:1:0:0

  Number of Display Devices: 0

My setup is:

i3-9100F (without processor graphics) + ASUS Prime Z390-A + nVidia GTX 1080Ti

Ubuntu 18.04 with nvidia driver version: 430.5 in system.

I run nvidia-docker with container tensorflow/tensorflow:1.14.0-gpu-py3 and start training there (inside).

nfancure I run from within the system (not in docker).

@saippuakauppias
Copy link
Author

I have not found the perfect solution, maybe you have something?

I am now launch nfancurve from cron every 5 minutes with the following code:

#!/bin/bash

PID=$(ps aux | grep nfancurve/temp.sh | grep -v grep | awk {'print $2'})

if [ ! -z "${PID}" ]
then
        kill -9 ${PID}
        sleep 5
fi

DISPLAY=:0 XAUTHORITY=/run/user/120/gdm/Xauthority sh /home/fullusr/nfancurve/temp.sh -d ":0" -l

But it seems to me that this is a bad decision.

@nan0s7
Copy link
Owner

nan0s7 commented Mar 14, 2020

The fact that this works for you is quite interesting. I could make a "careful" mode, where it makes sure that everything is what it was set at the beginning of the program whenever it wants to change the fan speed if you like. That may work, but it would require some back and forth testing between us. :)

Edit: Infact, doing some reading about the nvidia-docker program, there may be something to alter. I'm reading through the documentation now.

Edit 2: Have you tried running the script from the docker command? I don't know if this could be a problem you're aware of, but https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#why-is-nvidia-smi-inside-the-container-not-listing-the-running-processes
For example from the main page under "Usage" there's this example:
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
could you use a similar thing?
docker run --gpus all sh /home/fullusr/nfancurve/temp.sh -d ":0" -l
May have to use the suggested --pid=host...
I'm not 100% sure what this actually does but hopefully I'll find more information soon.

@saippuakauppias
Copy link
Author

saippuakauppias commented Mar 17, 2020

I could make a "careful" mode

That would be great and interesting!

Have you tried running the script from the docker command? I don't know if this could be a problem you're aware of, but...

I did not know about it, it all looks strange...

For example from the main page under "Usage" there's this example

I tested 10.0-base, 10.0-runtime, 10.0-cudnn7-runtime, 10.0-devel and 10.0-cudnn7-devel and they all throw an error:

docker run --gpus all -v /home/fullusr/nfancurve/:/nf nvidia/cuda:10.0-cudnn7-devel sh /nf/temp.sh -d ":0" -l

Configuration file: /nf/config
/nf/temp.sh: 75: /nf/temp.sh: nvidia-settings: not found
No Fans detected

Adding --pid="host" to the command does nothing else.

But running docker run --gpus all nvidia/cuda:10.0-base nvidia-smi command works great (hmm.. but the process list is not displayed).

@nan0s7
Copy link
Owner

nan0s7 commented Mar 19, 2020

Hmm I was reading through some of the NVIDIA documentation; could the errors when you run my script come from whatever training you're doing finishes, and then the trainer resets the GPU or something like that?

Reading through our conversation a few times over it seems like after you get an error, most of the time the script continues on and works fine until the next error. Is that correct?

If so, I will work in a patch that will try and prevent the script from failing when you experience a problem with the GPU as you've told me.

At the beginning, you mention that you know people who just set the fan speed to maximum when doing any training. Do you know how they do that? Is it a similar method to what my script uses? (like nvidia-settings -a [fan:0]/GPUTargetFanSpeed=100)

This information should help when I make the patch. Hope you're not too effected by the virus going around though! :)

@saippuakauppias
Copy link
Author

The task in crontab does not work properly, unfortunately.
Most likely, due to memory leaks inside the neural network, the graphical interface (Gnome?/xorg?) (which is necessary for nfancurve to work) crashes.

When I try to manually start the task, I see:

Unable to init server: Could not connect: Connection refused
ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

No Fans detected


trainer resets the GPU or something like that?

To be honest - I don’t know for sure, but I have a session reset for tensorflow - I try to avoid frequent video memory leaks through it (but this cannot be fixed).

Reading through our conversation a few times over it seems like after you get an error, most of the time the script continues on and works fine until the next error. Is that correct?

It used to be (as it happened now - I wrote at the very beginning).
Previously, he tried to set the temperature again and again, but it did not change due to errors. And only restarting temp.sh helped solve this problem.

Do you know how they do that?

Yes, here is the instruction from them:

write first
sudo nvidia-xconfig -a --cool-bits=31 --allow-empty-initial-configuration --enable-all-gpus --separate-x-screens
Reboot. We start Xorg if it is not started. And then we write on each video card:
sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -a "[gpu:0]/GPUFanControlState=1"
sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=85"

Hope you're not too effected by the virus going around though! :)

I'm okay, thanks! I hope everything is calm with you too...
Sorry for my bad English.

@nan0s7
Copy link
Owner

nan0s7 commented Mar 20, 2020

What do you mean by manually starting the task? Just in the shell?

I see that they use sudo for whatever reason. Have you tried running my script with sudo before? I doubt it'd do much but it might help with some of the errors.

I'll see if I can do some stuff to the script.

@saippuakauppias
Copy link
Author

saippuakauppias commented Mar 20, 2020

What do you mean by manually starting the task? Just in the shell?

Yes :)

I see that they use sudo for whatever reason. Have you tried running my script with sudo before? I doubt it'd do much but it might help with some of the errors.

I run it from root user

PS: Regarding the fall of the graphical shell (xorg/gdm): I think that I should test the automatic launch of this service in the event of a fall (for example, using monit)

@nan0s7
Copy link
Owner

nan0s7 commented Mar 23, 2020

Sorry I've been quite busy; I assume you've tried running it without root?

I'll let you know when I get the patch to try out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants