Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote server install failure #23

Open
wboykinm opened this issue Jan 11, 2018 · 2 comments
Open

Remote server install failure #23

wboykinm opened this issue Jan 11, 2018 · 2 comments

Comments

@wboykinm
Copy link

Following the remote-launch outline laid out in @albarji's blog post . . .

  1. Booting a remote p2.xlarge server with Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-1020-aws x86_64)
  2. Cloning the repo
  3. Running the install script

. . . I get this:

./scripts/install-nvidia.sh
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libc6-dev' instead of 'libc-dev'
gcc is already the newest version (4:5.3.1-1ubuntu1).
make is already the newest version (4.1-6).
libc6-dev is already the newest version (2.23-0ubuntu9).
0 upgraded, 0 newly installed, 0 to remove and 128 not upgraded.
--2018-01-11 15:31:19--  http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run
Resolving us.download.nvidia.com (us.download.nvidia.com)... 192.229.211.70, 2606:2800:21f:3aa:dcf:37b:1ed6:1fb
Connecting to us.download.nvidia.com (us.download.nvidia.com)|192.229.211.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86760004 (83M) [application/octet-stream]
Saving to: ‘/tmp/NVIDIA-Linux-x86_64-361.42.run.1’

NVIDIA-Linux-x86_64-361.42.run.1             100%[=============================================================================================>]  82.74M   140MB/s    in 0.6s    

2018-01-11 15:31:19 (140 MB/s) - ‘/tmp/NVIDIA-Linux-x86_64-361.42.run.1’ saved [86760004/86760004]

Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 361.42...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources,
       with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA
       kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver
       release.
       
       Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README
       available on the Linux driver download page at www.nvidia.com.

--2018-01-11 15:31:53--  https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2018-01-11 15:31:53 ERROR 502: Bad Gateway.

dpkg: error processing archive /tmp/nvidia-docker*.deb (--install):
 cannot access archive: No such file or directory
Errors were encountered while processing:
 /tmp/nvidia-docker*.deb
sudo: nvidia-docker: command not found

This seems like a driver mismatch. I'm unable to test this locally, unfortunately (wrong GPU), so I'm left to guess if the image needs rebuilding or if I need to change my EC2 config somehow. It looks like the appropriate driver version needs a bump.

@wboykinm
Copy link
Author

UPDATE: I bumped the driver to the [apparently] current version, and it threw the same error as above.

@albarji
Copy link
Owner

albarji commented Jan 16, 2018

Hey @wboykinm ! It's been a while since I last used that script for deploying this container, so I'm afraid it's pretty much outdated. My recommendation right now would be to create an instance based on one of the AMIs provided by NVIDIA, which already comes prepared with the appropriate drivers and nvidia-toolkit versions.

I use the AMI named "NVIDIA CUDA Toolkit 7.5 on Amazon Linux" an that one works pretty well, the only thing you need to manually install after creating the instance would be docker and nvidia-docker. After that you should be ready to run the container!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants