Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliably repeating pytorch system crash/reboot when using imagenet examples #3022

Closed
castleguarders opened this issue Oct 8, 2017 · 75 comments

Comments

@castleguarders
Copy link

So I have a 100% repeatable system crash (reboot) when trying to run the imagenet example (2012 dataset). resnet18 defaults. The crash seems to happen at Variable.py at torch.autograd.backward(..) (line 158).

I am able to run the basic mnist example successfully.

Setup: Ubuntu 16.04, 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 09:02:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

python --version Python 3.6.2 :: Anaconda, Inc.

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

nvidia-smi output.
Sat Oct 7 23:51:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:03:00.0 On | N/A |
| 14% 51C P8 18W / 250W | 650MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1335 G /usr/lib/xorg/Xorg 499MiB |
| 0 2231 G cinnamon 55MiB |
| 0 3390 G ...-token=C6DE372B6D9D4FCD6453869AF4C6B4E5 93MiB |
+-----------------------------------------------------------------------------+

torch/vision was built locally on the machine from master. No issues at compile or install time, other than the normal compile time warnings...

Happy to help get further information..

@vadimkantorov
Copy link
Contributor

I have experienced random system reboots once due to motherboard - GPU incompatibility. This has manifested during long training. Do other frameworks (e.g. caffe) succeed in training on ImageNet?

@castleguarders
Copy link
Author

Haven't tried that yet. However ran some long running graphics bench ;) with no problems. I could probably look at giving other frameworks a shot, what's your recommendation. Caffe?

Do keep in mind, the crash I reported happens practically immediately (mnist-cuda example runs to completion many times without an issue). So I doubt that it's a h/w incompatibility issue.

@apaszke
Copy link
Contributor

apaszke commented Oct 8, 2017

Can you try triggering the crash once again and see if anything relevant is printed in /var/log/dmesg.0 or /var/log/kern.log?

@castleguarders
Copy link
Author

Zero entries related to this in either dmesg or kern.log. The machine does an audible click and resets, so I think it's the h/w registers or memory being twiddled in a way it doesn't like. No real notice to kernel to log anything. Reboots at the same line of code each time, at least the few times I stepped through it.

@apaszke
Copy link
Contributor

apaszke commented Oct 10, 2017

That's weird. To be honest I don't have any good ideas for debugging such issues. My guess would be that it's some kind of a hardware problem, but I don't really know.

@soumith
Copy link
Member

soumith commented Oct 10, 2017

it's definitely a hardware issue as well. Whether it's at an nvidia driver level, or a bios / hardware failure.
I'm closing the issue, as there's no action to be taken on the pytorch project side.

@soumith soumith closed this as completed Oct 10, 2017
@castleguarders
Copy link
Author

For future reference, the issue was due to steep power ramp of 1080ti's triggering server power supply over voltage protection. Only some pytorch examples caused it to show up.

@yurymalkov
Copy link

@castleguarders Have you figured out how to solve this issue? It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time.

@pmcrodrigues
Copy link

@castleguarders I am having similar issues, how did you found that that was the problem?

@castleguarders
Copy link
Author

@pmcrodrigues There was an audible click whenever the issue happened. I used nvidia-smi to soft control the power draw, this allowed the tests a bit longer, but trip anyways. I switched to a 825W Delta powersupply and it took care of the issue fully. Furmark makes easy work of testing this if you run windows. I ran it fully pegged for a couple of days, while driving the CPUs 100% with a different script. It's zero issues since then.

@yurymalkov I only have 1x 1080ti, didn't dare to put a second one.

@yurymalkov
Copy link

@pmcrodrigues @castleguarders
I've also "solved" the problem by feeding the second GPU from a separate PSU (1000W+1200W for 2X 1080Ti). Reducing the power draw by 0.5X via nvidia-smi -pl also helped, but it killed the performance. Also tried different motherboards/GPUs but it didn't help.

@pmcrodrigues
Copy link

@castleguarders @yurymalkov Thank you both. I have also tried to reduce power draw via nvidia-smi and it stopped crashing the system. But with stress tests at full power draw simultaneously on my 2 xeons (with http://people.seas.harvard.edu/~apw/stress/) and the 4 1080ti (with https://github.com/wilicc/gpu-burn) didn't make it crash. So for now I have only seen this problem on pytorch. Maybe I need other stress tests?

@yurymalkov
Copy link

@pmcrodrigues gpuburn seems to be a bad test for this, as it does not create steep power ramps.
I.e. a machine could pass gpuburn with 4 gpus, but failed at 2 gpus with a pytorch script.

The problem reproduces on some other frameworks (e.g. tensorflow), but it seems that pytorch scripts are the best test, probably because of the highly synchronous nature.

@gurkirt
Copy link

gurkirt commented May 2, 2018

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

@pmcrodrigues
Copy link

@gurkirt For now, I am only using 2 GPUs with my 1500W PSU. If you want to test reducing the power draw you can use "nvidia-smi -pl X" where X is the new power draw. For my gtx 1080i I used "nvidia-smi -pl 150" whereas standard draw is 250W. I am waiting on more potent PSU to test if it solves the problem. Currently I have a measuring device to measure the power coming directly from the wall, but even when I am using 4 GPUs it does not pass the 1000W. It can still be some weird peaks that are not being registered but something is off. Either way, we probably need to go with the dual 1500W PSUs.

@gurkirt
Copy link

gurkirt commented May 2, 2018

@pmcrodrigues thanks a lot for quick response. I have another system which has 2000W with 4 1080Ti's. That works just fine. I will try plugging that power supply in this machine and see if 2000W is enough on this machine.

@gurkirt
Copy link

gurkirt commented May 2, 2018

@pmcrodrigues did you find any log/warning/ crash report somewhere?

@pmcrodrigues
Copy link

@gurkirt None.

@lukepfister
Copy link

I’m having a similar problem- audible click, complete system shutdown.

It seems that it only occurs with BatchNorm layers in place. Does that match with your experience?

@gurkirt
Copy link

gurkirt commented Aug 8, 2018

I was using the resenet at that time. It is a problem of inadequate power supply problem. It is a hardware problem. I needed to upgrade the power supply. According to my searches online, the power surge is a problem of pytorch. I upgraded the power supply from 1500W to 1600W. The problem still appears now and then but only when the room temperature is a bit higher. I think there are two factors at play, room temperature and another major being the power supply.

@dov
Copy link

dov commented Aug 16, 2018

I have the same problem with a 550W power supply and a GTX1070 graphics cards. I start the learning and about a second later the power cuts.

But this made me thinking that perhaps it would be possible to trick/convince the PSU that everything is ok by creating a ramp-up function that e.g. mixes between sleeps and gpu activity and gradially increases the load. Has anyone tried this? Does someone have minimal code that reliably triggers the power cut?

@vwvolodya
Copy link

Had the same issue with GTX1070 but reboots were not random.
I had a code that was able to make my PC reboot every time i run it after at most 1 epoch.
At first i thought it can be PSU since mine has only 500W. However after closer investigation and even setting max power consumption to lower values with nvidia-smi i realized the issue is somewhere else.
It was not an overheating problem as well so i started to think that it might be because of I7-7820x Turbo mode. After disabling Turbo mode in BIOS settings of my Asus X299-A and changing Ubuntu's configuration as stated here the issue seems to be gone.

What did NOT work:

  • Changing pin_memory for dataloaders.
  • Playing with batch size.
  • Increasing system shared memory limits.
  • Setting nvidia-smi -pl 150 out of 195 possible for my system.

Not sure if this is related to native BIOS issues. I am running 1203 version while the latest is 3 releases ahead -- 1503 and they put

improved stability

into the description of each of those 3. Asus X299-A BIOS versions One of those releases had also

Updated Intel CPU microcode.

So there is a chance this is fixed.

@dov
Copy link

dov commented Sep 6, 2018

For the record, my problem was a broken power supply. I diagnosed this by running https://github.com/wilicc/gpu-burn on Linux and then FurMark on Windows, under the assumption, that unless I can reproduce the crash on Windows, they won't talk to me in my computer shop. Both these tests failed for me, wherupon I took the computer for repair and got a new power supply. Since then, I have been running pytorch for hours without any crashes.

@DanielLongo
Copy link

Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

1 similar comment
@DanielLongo
Copy link

Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

@yaynouche
Copy link

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

@zym1010
Copy link
Contributor

zym1010 commented Jan 20, 2019

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X).

UPDATE: turns out, I must enable turbo mode in BIOS and use commands like echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo as in #3022 (comment) to disable the turbo via software. If I disable turbo mode in the BIOS, then still the machine will reboot.

UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it.

@deerleo
Copy link

deerleo commented Apr 7, 2019

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

facing the same problem. 4 GTX 1080Ti with 1600W PSU (With redundancy) . Tried to use gpu burn to test it and it's stable like a rock.

@zym1010
Copy link
Contributor

zym1010 commented Apr 7, 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

@qwesdfok
Copy link

I am not sure what I face is the same as the problem. My computer uses 1080Ti, and if the GPU Memory usage is close to 100%, i.e. uses almost 11GB memory, it will reboot. But if I reduce the batch size of the network in order to decrease the memory usage, the reboot problem will not happen without upgrade the power. If someone meets the reboot problem, I hope my condition might help you.

@alpErenSari
Copy link

I face the same problem with 1080 Ti and a 450 W PSU and tried to reduce power consumption by typing command "sudo nvidia-smi -pl X" as a temporal solution. However, this did not work at first try. After that, I noticed that if you limit the power consumption first and type "nvidia-smi -lms 50" on another terminal to check the power and memory usage of the GPU just before starting the training, the I can train the network without problem. I'm waiting for a new PSU right now for a permanent solution.

@Caselles
Copy link

I too had this issue and was able to reproduce it with a Pytorch script without using any GPUs (only CPU). So I agree with @zym1010 for me it's a CPU issue. I updated my BIOS (ASUS WS X299 SAGE LGA 2066 Intel X299) and it seems to have stopped the issue from happening. However considering the comments in this thread I'm not entirely sure the issue is fixed...

@soumith Don't you think Pytorch contributors should look into this issue rather than just closing it? Pytorch seems to stress GPU/CPU in a way GPU/CPU stress tests do not. This is not an expected behaviour, and the problem affects many people. It seems like a rather interesting issue as well!

@zym1010
Copy link
Contributor

zym1010 commented Jul 16, 2019

I too had this issue and was able to reproduce it with a Pytorch script without using any GPUs (only CPU). So I agree with @zym1010 for me it's a CPU issue. I updated my BIOS (ASUS WS X299 SAGE LGA 2066 Intel X299) and it seems to have stopped the issue from happening. However considering the comments in this thread I'm not entirely sure the issue is fixed...

@soumith Don't you think Pytorch contributors should look into this issue rather than just closing it? Pytorch seems to stress GPU/CPU in a way GPU/CPU stress tests do not. This is not an expected behaviour, and the problem affects many people. It seems like a rather interesting issue as well!

@Caselles are you referring to BIOS version 1001? I saw it some time ago on ASUS website but seems that it has been taken down somehow.

@Caselles
Copy link

The BIOS I installed is this one: "WS X299 SAGE Formal BIOS 0905 Release".

@yurymalkov
Copy link

In my experience, this issue comes with different Thermaltake PSUs. In the last case, changing the PSU from Thermaltake platinum 1500W to Corsair HX1200 solved the problem on a two-2080Ti setup.

@pengyu965
Copy link

I have this issue with both CPU and GPU, which means rebooting happens even when I physically uninstall the GPU and only train the network on CPU without using dataloader

My power supply is EVGA 850w gold power supply, and CPU: i7-8700k, GPU: GTX 1080ti (just 1 piece)

And I have a ECO switch on my power supply, if I switch it to "on", it happens more often.

Just like what others said, the pressure test on both CPU and GPU pass.

So, a conclusion here:

  1. Reboot would happen even with training only on CPU, even after I removed the GPU physically.
  2. Turn on the ECO switch on PSU result in more-often reboot.
  3. I7-8700k+GTX 1080ti on 850W power supply.
  4. Only appears while using Pytorch even without Dataloader

@cognitiveRobot
Copy link

cognitiveRobot commented Sep 9, 2019

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.
For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.
Have a great day guys !

@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X).

UPDATE: turns out, I must enable turbo mode in BIOS and use commands like echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo as in #3022 (comment) to disable the turbo via software. If I disable turbo mode in the BIOS, then still the machine will reboot.

UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it.

My hardware details:

Motherboard: Asus WS X299 SAGE/10G 
CPU: Intel Core i9-9900X
GPU: Geforce RTX2080 TI - 11GB (4 of them)
Power supply: Masterwatt Maker - 1500Watts

Bios Version: 0905. Then updated to 1201.
Turbo enabled from bios and then set 1 in /sys/devices/system/cpu/intel_pstate/no_turbo
Tried other combinations.

Tested using https://github.com/wilicc/gpu-burn. All gpus are ok.

Whenever I am training maskrcnn_resnet50_fpn on coco dataset using 4 GPUs with batch size 4, the system is rebooting immediately. But, when I am using 3 GPUs with batch size 4 or 4 GPUs with batch size 2, it's training.

What could be the reason? Power supply?
I am dying to solve. I appreciate your comments.
Thanks in advance
Zulfi

@jeroneandrews-sony
Copy link

I also have this issue using 4 x Geforce RTX2080 TI - 11GB and a 1600W EVGA SuperNOVA Platinum PSU (I also tried swapping the PSU with a 1600W SuperNOVA EVGA Gold PSU) and the issue still occurs when using PyTorch with the 4 GPUs.

@yaynouche
Copy link

From my experience, reboot occur often when nvidia-persistenced is not installed and running.
Link: https://docs.nvidia.com/deploy/driver-persistence/index.html

Updating the Bios is also a crucial part of the solution. Hope it helps.

Best regards,

Yassine

@jeroneandrews-sony
Copy link

@gurkirt what are your other system specs?

I also have 4 x RTX 2080tis and a corsair 1600i psu but my pc still shuts down after awhile when using all 4 gpus.

@sjdrc
Copy link

sjdrc commented Mar 9, 2020

Hey just FYI I was experiencing this issue on multiple machines (all X299 with multiple 2080Tis), and after trying 4 different PSUs the Corsair AX1600I is the only one that I did not encounter reboots with.

@theairbend3r
Copy link

I have the same issue.
Machine config - Lenovo y540, RTX 2060, Ubuntu 18.04. I tried training a simple binary image classification model (4 conv layers with batchnorm). The model trained for 20 epochs (batch size = 8) and then my laptop shut down.

Output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     3W /  N/A |     10MiB /  5934MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Following is the log file before the system crashed I think. I found it in - cat /var/log/kern.log.

Mar 10 17:05:01 maverick kernel: [    9.279289] audit: type=1400 audit(1583840101.525:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=837 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [    9.280042] audit: type=1400 audit(1583840101.529:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=828 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [    9.325087] intel_rapl_common: Found RAPL domain package
Mar 10 17:05:01 maverick kernel: [    9.325092] intel_rapl_common: Found RAPL domain core
Mar 10 17:05:01 maverick kernel: [    9.325096] intel_rapl_common: Found RAPL domain uncore
Mar 10 17:05:01 maverick kernel: [    9.325100] intel_rapl_common: Found RAPL domain dram
Mar 10 17:05:01 maverick kernel: [    9.355748] input: HDA Intel PCH Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
Mar 10 17:05:01 maverick kernel: [    9.355987] input: HDA Intel PCH Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
Mar 10 17:05:01 maverick kernel: [    9.356199] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
Mar 10 17:05:01 maverick kernel: [    9.356895] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
Mar 10 17:05:01 maverick kernel: [    9.357074] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17
Mar 10 17:05:01 maverick kernel: [    9.357296] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input18
Mar 10 17:05:01 maverick kernel: [    9.357497] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input19
Mar 10 17:05:01 maverick kernel: [    9.432866] dw-apb-uart.2: ttyS4 at MMIO 0x8f802000 (irq = 20, base_baud = 115200) is a 16550A
Mar 10 17:05:01 maverick kernel: [    9.434397] iwlwifi 0000:00:14.3 wlp0s20f3: renamed from wlan0
Mar 10 17:05:01 maverick kernel: [    9.445610] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  430.50  Thu Sep  5 22:39:50 CDT 2019
Mar 10 17:05:01 maverick kernel: [    9.575171] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:01 maverick kernel: [    9.623512] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Mar 10 17:05:01 maverick kernel: [    9.623516] Bluetooth: BNEP filters: protocol multicast
Mar 10 17:05:01 maverick kernel: [    9.623525] Bluetooth: BNEP socket layer initialized
Mar 10 17:05:01 maverick kernel: [    9.664785] input: MSFT0001:01 06CB:CD5F Touchpad as /devices/pci0000:00/0000:00:15.1/i2c_designware.1/i2c-2/i2c-MSFT0001:01/0018:06CB:CD5F.0003/input/input24
Mar 10 17:05:01 maverick kernel: [    9.665154] hid-multitouch 0018:06CB:CD5F.0003: input,hidraw2: I2C HID v1.00 Mouse [MSFT0001:01 06CB:CD5F] on i2c-MSFT0001:01
Mar 10 17:05:01 maverick kernel: [    9.669632] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input20
Mar 10 17:05:01 maverick kernel: [    9.669880] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input21
Mar 10 17:05:01 maverick kernel: [    9.669932] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input22
Mar 10 17:05:02 maverick kernel: [    9.767641] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190703/nsarguments-66)
Mar 10 17:05:02 maverick kernel: [   10.035982] Generic Realtek PHY r8169-700:00: attached PHY driver [Generic Realtek PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
Mar 10 17:05:02 maverick kernel: [   10.149333] r8169 0000:07:00.0 enp7s0: Link is Down
Mar 10 17:05:02 maverick kernel: [   10.179246] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [   10.296096] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [   10.361833] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
Mar 10 17:05:02 maverick kernel: [   10.374304] iwlwifi 0000:00:14.3: BIOS contains WGDS but no WRDS
Mar 10 17:05:02 maverick kernel: [   10.378535] Bluetooth: hci0: Waiting for firmware download to complete
Mar 10 17:05:02 maverick kernel: [   10.379322] Bluetooth: hci0: Firmware loaded in 1598306 usecs
Mar 10 17:05:02 maverick kernel: [   10.379451] Bluetooth: hci0: Waiting for device to boot
Mar 10 17:05:02 maverick kernel: [   10.392359] Bluetooth: hci0: Device booted in 12671 usecs
Mar 10 17:05:02 maverick kernel: [   10.395240] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-17-16-1.ddc
Mar 10 17:05:02 maverick kernel: [   10.398388] Bluetooth: hci0: Applying Intel DDC parameters completed
Mar 10 17:05:03 maverick kernel: [   11.148057] nvidia-uvm: Unloaded the UVM driver in 8 mode
Mar 10 17:05:03 maverick kernel: [   11.171826] nvidia-modeset: Unloading
Mar 10 17:05:03 maverick kernel: [   11.219065] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
Mar 10 17:05:04 maverick kernel: [   12.125832] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Mar 10 17:05:04 maverick kernel: [   12.127484] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Mar 10 17:05:04 maverick kernel: [   12.175644] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  430.50  Thu Sep  5 22:36:31 CDT 2019
Mar 10 17:05:05 maverick kernel: [   13.205291] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  430.50  Thu Sep  5 22:39:50 CDT 2019
Mar 10 17:05:05 maverick kernel: [   13.250663] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:06 maverick kernel: [   13.986003] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:05:06 maverick kernel: [   13.994385] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [   14.047103] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [   14.063692] wlp0s20f3: authenticated
Mar 10 17:05:06 maverick kernel: [   14.068040] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [   14.097924] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=4)
Mar 10 17:05:06 maverick kernel: [   14.143288] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [   14.177499] wlp0s20f3: associated
Mar 10 17:05:06 maverick kernel: [   14.296025] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready
Mar 10 17:05:08 maverick kernel: [   16.376337] bpfilter: Loaded bpfilter_umh pid 1511
Mar 10 17:05:18 maverick kernel: [   26.325876] Bluetooth: RFCOMM TTY layer initialized
Mar 10 17:05:18 maverick kernel: [   26.325884] Bluetooth: RFCOMM socket layer initialized
Mar 10 17:05:18 maverick kernel: [   26.325892] Bluetooth: RFCOMM ver 1.11
Mar 10 17:05:19 maverick kernel: [   27.169380] rfkill: input handler disabled
Mar 10 17:08:10 maverick kernel: [  198.039283] ucsi_ccg 0-0008: failed to reset PPM!
Mar 10 17:08:10 maverick kernel: [  198.039292] ucsi_ccg 0-0008: PPM init failed (-110)
Mar 10 17:10:11 maverick kernel: [  319.690728] mce: CPU11: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [  319.690729] mce: CPU5: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [  319.690730] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690730] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690772] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690773] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690774] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690775] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690776] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690777] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690778] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690779] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690780] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690781] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.691710] mce: CPU5: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691713] mce: CPU11: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691716] mce: CPU11: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691717] mce: CPU5: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691777] mce: CPU0: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691781] mce: CPU7: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691783] mce: CPU6: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691787] mce: CPU2: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691790] mce: CPU1: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691793] mce: CPU8: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691798] mce: CPU10: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691800] mce: CPU4: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691804] mce: CPU3: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691807] mce: CPU9: Package temperature/speed normal
Mar 10 17:13:35 maverick kernel: [  523.048575] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:13:35 maverick kernel: [  523.055288] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [  523.097819] wlp0s20f3: authenticated
Mar 10 17:13:35 maverick kernel: [  523.099819] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [  523.107873] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=1)
Mar 10 17:13:35 maverick kernel: [  523.109523] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:13:35 maverick kernel: [  523.110798] wlp0s20f3: associated
Mar 10 17:13:35 maverick kernel: [  523.119975] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready

How can I stop this from happening again ie. stop pytorch training and not crash my system?

@sjdrc
Copy link

sjdrc commented Mar 10, 2020

@theairbend3r I'm not sure if you're having the same issue as the one here. As I understand it, when starting training with torch, the GPUs and CPU(s) ramp up so quickly that it can exceed normal power draw and trigger overload protection on the PSU. I was always experiencing this before the first epoch ended.

Sorry I don't have any more useful suggestions for you.

@sdsy888
Copy link

sdsy888 commented Mar 17, 2020

Several possible solutions: (not sure if anyone of them could fix the problem independently)

  • BIOS version: I followed the discussion above to update my BIOS version from 3501 to 4001 (Asus X99-E WS/USB3.1), problem solved.
  • Setting the Nvidia GPU fan: I changed the GPU fan speed to reduce the risk of high temperatures that could cause emergent shutdown/reboot.
  • Lower the num_worker from 12 to 4 (the max #core on my server is 12).
  • Insufficient power of the power supplier. My situation is: change HDD to SDD, thus the speedup of the whole pipeline of my task adds too much pressure on the power supply.

@soupault
Copy link

It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time.

Faced this issue with 2x 2080ti on multiple PCs with platiunum 1000W and 1200W. Worked fine when using only 1 GPU, but not 2. Solved by upgrading the PSU to 1600W.

@yueqiw
Copy link

yueqiw commented Jun 27, 2020

Had the same issue with 2080 Ti on 750W G2 Gold PSU. Solved after changing the PSU to 1600W P2.

@joe807191330
Copy link

joe807191330 commented Aug 21, 2020

oot occur often when nvidia-persistenced is not installed and running.
Link: https://docs.nvidia.com/deploy/driver-persistence/in

It worked ,when i used nvidia-persistenced. But the computer will be rebooted after a while.

@escyes
Copy link

escyes commented May 29, 2021

It worked for me to deactivate the intel turbo boost which seems to indicate that it is a cpu problem, when I monitor the temperature of the cpu cores before deactivating boost it rose very quickly to 60 degrees now it stays at less than 50
it happens to me with a NMIST example in batch 7

@escyes
Copy link

escyes commented May 29, 2021

Only in pytorch, it's fine in tensorflow

@ShoufaChen
Copy link

I solved similar problem (RTX 3090 gpu) by limited gpu power (from 350W to 250W):

# enable persistence mode
sudo nvidia-smi -pm 1

# limite power from 350W to 250W
sudo nvidia-smi -i 0,1,...,3 -pl

@lfcnassif
Copy link

lfcnassif commented Sep 16, 2022

Just to let those with the same issue know:

We had this often reboot issue with one machine with 2 RTX 3090 + 1000W PSU . Initially we suspected it was related to the PSU as reported by many here, because using just 1 RTX while running inferences, it worked. When using both GPUs, it used to reboot and just worked by decreasing the GPUs power from 370W to 200W using nvidia-smi -pl 200 command. Using 250W value wasn't enough.

But then we experimented with another machine with just one RTX 3090 + 1400W PSU and the reboot happened almost all time while running inferences. Because of a previous post here suggesting CPU turbo boost could be an issue (disabling it didn't work for us), we suspected the GPU overclocking/boost could be an issue. Then we saw RTX 3090 clock reaching ~1900MHz while running FurMark stress tests. So we limited it with nvidia-smi -lgc minGpuClock,maxGpuClock using 1395,1740 according to RTX 3090 specs. Then all reboots on both machines were fixed completely, including the one with two RTX 3090, without replacing any PSUs or limiting GPU power consumption.

I hope this helps others.

@javierbg
Copy link

according to RTX 3090 specs

@lfcnassif Which specs are you following, exactly?

@legel
Copy link

legel commented Dec 18, 2023

Been trying to run 2 RTX 3090's on a brand new Corsair 1500W PSU... with many sudden shutdowns. I'm thinking there's no way I don't have enough watts, right?

I was worried it could be a temperature issue, so was obsessively logging temps for all devices, and running fans at max speed, right up to shutdown. GPUs were a "cool" ~60 C -- not enough to warrant causing the shutdown.

Thanks to @lfcnassif -- indeed the trick was simply running:
sudo nvidia-smi -lgc 1395,1740

(where the -lgc sets min, max clock speeds)

It turns out that max clock speed is not much lower than the ~1900 MHz that the GPU would naturally run at, but basically I think we're just preventing sudden synchronous large bursts of computational hunger...

And after training a run for a long time now, I'm posting "confirmation", this works. :)

@PhillipGH
Copy link

I just had this issue on my windows 11 PC. Weirdly, training worked fine, but about once for every ten predictions my model made, the PC would suddenly turn off. Like a lot of people here with this problem, I have an ASUS motherboard, so I tried going into BIOS and turning off turbo mode and that instantly fixed this issue for me.

@konrad0101
Copy link

konrad0101 commented Mar 27, 2024

I have similar hardware to @legel , specifically 2 x 3090s and Corsair HX1500i, and experienced my Epyc Rome machine shutting down on PyTorch workloads.

The solution was to replace the Y splitter cable going from the PSU to the GPUs with single cables (each 3090 has two 8-pin connectors for power input and Corsair provides one cable that has 16-pins via a Y-splitter). Change that 16-pin cable to two 8-pin cables. For me, it worked immediately without needing to throttle the clock speeds of the GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests