Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU #51

Open
Sohojoe opened this issue Feb 14, 2019 · 10 comments
Assignees

Comments

@Sohojoe
Copy link

Sohojoe commented Feb 14, 2019

Update: GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU (error below)


Hi, I am following the tutorial Training an Obstacle Tower agent using Dopamine and the Google Cloud Platform

I am getting the following error - I believe the problem is (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID - but I'm not sure of the root cause.

I was trying to use the T4 GPU to save $$ - I will try again with the default GPU

image

after typing

sudo /usr/bin/X :0 &
export DISPLAY=:0

I get this error

X.Org X Server 1.19.2
Release Date: 2017-03-02
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.9.0-8-amd64 x86_64 Debian
Current Operating System: Linux tensorflow-1-vm 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.9.0-8-amd64 root=UUID=995b3d50-0ab0-4faa-8296-ab743ab0fde7 ro net.ifnames=0 biosdevname=0 console=ttyS0,38400n8 elevator=noop scsi_mod.use_blk_mq=Y
Build Date: 03 November 2018  03:09:11AM
xorg-server 2:1.19.2-1+deb9u5 (https://www.debian.org/support) 
Current version of pixman: 0.34.0
	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Thu Feb 14 01:06:15 2019
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE) 
Fatal server error:
(EE) no screens found(EE) 

/var/log/Xorg.0.log

[   385.871] (II) Module "ramdac" already built-in
[   385.877] (**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
[   385.877] (==) NVIDIA(0): RGB weight 888
[   385.877] (==) NVIDIA(0): Default visual is TrueColor
[   385.877] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[   385.877] (**) NVIDIA(0): Option "UseDisplayDevice" "None"
[   385.877] (**) NVIDIA(0): Enabling 2D acceleration
[   385.877] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; enabling NoScanout
[   385.877] (**) NVIDIA(0):     mode
[   385.877] (II) Loading sub module "glxserver_nvidia"
[   385.877] (II) LoadModule: "glxserver_nvidia"
[   385.877] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
[   385.882] (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
[   385.882]    compiled for 4.0.2, module version = 1.0.0
[   385.882]    Module class: X.Org Server Extension
[   385.882] (II) NVIDIA GLX Module  410.72  Wed Oct 17 20:11:21 CDT 2018
[   386.482] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[   386.482] (EE) NVIDIA(GPU-0):     displayless
[   386.482] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.
[   386.563] (EE) NVIDIA(0): Failing initialization of X screen 0
[   386.563] (II) UnloadModule: "nvidia"
[   386.563] (II) UnloadSubModule: "glxserver_nvidia"
[   386.563] (II) Unloading glxserver_nvidia
[   386.563] (II) UnloadSubModule: "wfb"
[   386.563] (II) UnloadSubModule: "fb"
[   386.563] (EE) Screen(s) found, but none have a usable configuration.
[   386.563] (EE)
Fatal server error:
[   386.563] (EE) no screens found(EE)
[   386.563] (EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
[   386.563] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[   386.563] (EE)
[   386.564] (EE) Server terminated with error (1). Closing log file.
@Sohojoe
Copy link
Author

Sohojoe commented Feb 14, 2019

OK - the problem is with the T4 GPU - I've been able to get it running with the default GPU.

It would be good to figure this out as the T4 is 1/3rd of the price

@Sohojoe Sohojoe changed the title 'no screens found' error when following GCP tutorial GCP suggests using T4 GPU to save costs, but fails when using T4 GPU Feb 14, 2019
@Sohojoe Sohojoe changed the title GCP suggests using T4 GPU to save costs, but fails when using T4 GPU GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU Feb 14, 2019
@awjuliani
Copy link
Contributor

@ervteng Do you know about using different GPUs in this scenario?

@awjuliani awjuliani self-assigned this Feb 14, 2019
@ervteng
Copy link
Contributor

ervteng commented Feb 19, 2019

I've been able to use both T4 and P4 GPUs for training Unity environments (including Obstacle Tower). @Sohojoe do you have the /etc/X11/xorg.conf for the problematic machine?

@Sohojoe
Copy link
Author

Sohojoe commented Feb 20, 2019

here you go:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 410.72


Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla T4"
    BusID          "PCI:0:4:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1280 1024
        Depth       24
    EndSubSection
EndSection

These are the options it gives me:

image

@Arishtanemi2
Copy link

Arishtanemi2 commented Feb 20, 2019

I've been getting the same error too.I am using a T4 and have done all the previous steps completely.Here is my xorg.conf file:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 410.72
Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection
Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla T4"
    BusID          "0:4:0"
    Option         "AllowEmptyInitialConfiguration"
EndSection
Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1280 1024
        Depth       24
    EndSubSection
EndSection

@MetaZhi
Copy link

MetaZhi commented Feb 28, 2019

Any suggestion on this? I also encounter into this issue.

@MetaZhi
Copy link

MetaZhi commented Feb 28, 2019

I find the solution and it works for me:

delete or comment(with "#") ServerLayout and Screen section in /etc/X11/xorg.conf file

@htdt
Copy link

htdt commented Apr 9, 2019

same issue & solution for tesla V100

@juge2
Copy link

juge2 commented Jul 16, 2019

For me only removing Option "UseDisplayDevice" "none" in "Screen" Section does also the trick.

@zeromodule
Copy link

@zhenghongzhi @juge2 guys you've helped us so much! thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants