Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda 3.0? #25

Closed
infojunkie opened this issue Nov 9, 2015 · 101 comments
Closed

Cuda 3.0? #25

infojunkie opened this issue Nov 9, 2015 · 101 comments

Comments

@infojunkie
Copy link

Are there plans to support Cuda compute capability 3.0?

@zheng-xq
Copy link
Contributor

zheng-xq commented Nov 9, 2015

Officially, Cuda compute capability 3.5 and 5.2 are supported. You can try to enable other compute capability by modifying the build script:

https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc#L236

@infojunkie
Copy link
Author

Thanks! Will try it and report here.

@zheng-xq
Copy link
Contributor

zheng-xq commented Nov 9, 2015

This is not officially supported yet. But if you want to enable Cuda 3.0 locally, here are the additional places to change:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L610
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L629
Where the smaller GPU device is ignored.

The official support will eventually come in a different form, where we make sure the fix works on all different computational environment.

@keveman keveman added the cuda label Nov 9, 2015
@infojunkie
Copy link
Author

I made the changes to the lines above, and was able to compile and run the basic example on the Getting Started page: http://tensorflow.org/get_started/os_setup.md#try_your_first_tensorflow_program - it did not complain about gpu, but it didn't report using the gpu either.

How can I help with next steps?

@zheng-xq
Copy link
Contributor

infojunkie@, could you post your step and upload the log?

If you were following this example:

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

If you see the following line, the GPU logic device is being created:

Creating TensorFlow device (/gpu:0) -> (device: ..., name: ..., pci bus id: ...)

If you want to be absolutely sure GPU was used, set CUDA_PROFILE=1 and enable Cuda profiler. If the Cuda profiler logs were generated, it was a sure sign GPU was used.

http://docs.nvidia.com/cuda/profiler-users-guide/#command-line-profiler-control

@infojunkie
Copy link
Author

I got the following log:

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.967
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 896.49MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 730324992
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

I guess it means the GPU was found and used. I can try the CUDA profiler if you think it's useful.

@udibr
Copy link

udibr commented Nov 10, 2015

Please prioritize this issue. It is blocking gpu usage on both OSX and AWS's K520 and for many people this is the only environments available.
Thanks!

@graphific
Copy link

Not the nicest fix, but just comment out the the cuda compute version check at gpu_device.c line 610 to 616, recompile, and amazon g2 GPU acceleration seems to works fine:

example

@infojunkie
Copy link
Author

For reference, here's my very primitive patch to work with Cuda 3.0: https://gist.github.com/infojunkie/cb6d1a4e8bf674c6e38e

@markusdr
Copy link

@infojunkie I applied your fix, but I got lots of nan's in the computation output:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
000006/000003 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000004/000003 lambda = 2.000027 x = [79795.101562 -39896.468750] y = [159592.375000 -79795.101562]
000005/000006 lambda = 2.000054 x = [39896.468750 -19947.152344] y = [79795.101562 -39896.468750]
000001/000007 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000002/000003 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000009/000008 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000004/000004 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000001/000005 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000006/000007 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000003/000006 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]
000006/000006 lambda =     -nan x = [0.000000 0.000000] y = [0.000000 0.000000]

@zheng-xq
Copy link
Contributor

@markusdr, this is very strange. Could you post the completely steps you build the binary?

Could what GPU and OS are you running with? Are you using Cuda 7.0 and Cudnn 6.5 V2?

@avostryakov
Copy link

Just +1 to fix this problem on AWS as soon as possible. We don't have any other GPU cards for our research.

@allanzelener
Copy link

Hi, not sure if this is a separate issue but I'm trying to build with a CUDA 3.0 GPU (Geforce 660 Ti) and am getting many errors with --config=cuda. See the attached file below. It seems unrelated to the recommended changes above. I've noticed that it tries to compile a temporary compute_52.cpp1.ii file which would be the wrong version for my GPU.

I'm on Ubuntu 15.10. I modified the host_config.h in the Cuda includes to remove the version check on gcc. I'm using Cuda 7.0 and cuDNN 6.5 v2 as recommended, although I have newer versions installed as well.

cuda_build_fail.txt

@markusdr
Copy link

Yes, I was using Cuda 7.0 and Cudnn 6.5 on an EC2 g2.2xlarge instance with this AIM:
cuda_7 - ami-12fd8178
ubuntu 14.04, gcc 4.8, cuda 7.0, atlas, and opencv.
To build, I followed the instructions on tensorflow.org.

@vsrikarunyan
Copy link

It looks like we are seeing an API incompatibility between Compute Capability v3 and Compute Capability v3.5; post infojunkie's patch fix, I stumped onto this issue

I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2100M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
F tensorflow/stream_executor/cuda/cuda_blas.cc:229] Check failed: f != nullptr could not find cublasCreate_v2 in cuBLAS DSO; dlerror: bazel-bin/tensorflow/cc/tutorials_example_trainer: undefined symbol: cublasCreate_v2

I run on Ubuntu 15.04, gcc 4.9.2, CUDA Toolkit 7.5, cuDNN 6.5;

+1 for having Compute Capability v3 Support

@graphific
Copy link

is cublas installed? and where does it link to
ls -lah /usr/local/cuda/lib64/libcublas.so ?

@zheng-xq
Copy link
Contributor

@allanzelener, what OS and GCC versions do you have? Your errors seem to come from incompatible C++ compilers.

It is recommended to use Ubuntu 14.04 and GCC 4.8 with TensorFlow.

@zheng-xq
Copy link
Contributor

@vsrikarunyan, it is better to use CUDA Toolkit 7.0, as recommended. You can install an older CUDA Toolkit along with your newer toolkit. Just point TensorFlow "configure" and maybe LD_LIBRARY_PATH to the CUDA 7.0 when you run TensorFlow.

@zheng-xq
Copy link
Contributor

@avostryakov, @infojunkie's early patch should work on AWS.

https://gist.github.com/infojunkie/cb6d1a4e8bf674c6e38e

An official patch is working its way through the pipeline. It would expose a configuration option to let you choose your compute target. But underneath, it does similar changes. I've tried it on AWS g2, and find out once things would work, after I completely uninstall NVIDIA driver, and reinstall the latest GPU driver from NVIDIA.

Once again, the recommended setting on AWS at this point is the following.
Ubuntu 14.04, GCC 4.8, CUDA Toolkit 7.0 and CUDNN 6.5. For the last two, it is okay to install them without affecting your existing installation of other versions. Also the official recommended versions for the last two might change soon as well.

@jbencook
Copy link

I applied the same patch on a g2.2xlarge instance and got the same result as @markusdr... a bunch of nan's.

@allanzelener
Copy link

@zheng-xq Yes, I'm on Ubuntu 15.10 and I was using GCC 5.2.1. The issue was the compiler. I couldn't figure out how to change the compiler with bazel but simply installing gcc-4.8 and using update-alternatives to change the symlinks in usr/bin seems to have worked. (More info: http://askubuntu.com/questions/26498/choose-gcc-and-g-version). Thanks for the help, I'll report back if I experience any further issues.

@nbenhaim
Copy link

I did get this to work on a g2.2xlarge instance and got the training example to run, and verified that the gpu was active using the nvidia-smi tool , but when running mnist's convolutional.py , it ran out of memory. I suspect this just has to do with the batch size and the fact that the aws gpus don't have a lot of memory, but just wanted to throw that out there to make sure it sounds correct. To clarify, I ran the following, and it ran for like 15 minutes , and then ran out of memory.

python tensorflow/models/image/mnist/convolutional.py

@anjishnu
Copy link

@nbenhaim, just what did you have to do to get it to work?

@zheng-xq
Copy link
Contributor

@markusdr, @jbencook, the NAN is quite troubling. I ran the same thing myself, and didn't have any problem.

If you use the recommended software setting: Ubuntu 14.04, GCC 4.8, Cuda 7.0 and Cudnn 6.5, then my next guess is the Cuda driver. Could you uninstall and reinstall the latest Cuda driver.

This is the sequence I tried on AWS, your mileage may vary:

sudo apt-get remove --purge "nvidia*"
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/352.55/NVIDIA-Linux-x86_64-352.55.run
sudo ./NVIDIA-Linux-x86_64-352.55.run --accept-license --no-x-check --no-recursion

@jbencook
Copy link

Thanks for following up @zheng-xq - I'll give that a shot today.

@mjwillson
Copy link

Another +1 for supporting pre-3.5 GPUs, as someone else whose only realistic option for training on real data is AWS GPU instances.

Even for local testing, turns out my (recent, developer) laptop's GPU doesn't support 3.5 :-(

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
@wingdi
Copy link

wingdi commented Aug 23, 2017

I have the same problem :"Ignoring gpu device (device:0,name:GeForce GT 635M, pci bus id) with Cuda compute capability 2.1. The minimum required Cuda capability is 3.0 ." . @smtabatabaie @martinwicke @alphaJatin. help !!!!

@martinwicke
Copy link
Member

Compute capability 2.1 is too low to run TensorFlow. You'll need a newer (or more powerful) graphics card to run TensorFlow on a GPU.

@mengxingxinqing
Copy link

The url of answer to the question is invalid. Can you update it?

@gunan
Copy link
Contributor

gunan commented Aug 8, 2018

For nightly pip packages, recommended way to install is to use pip install tf-nightly command.
ci.tensorflow.org is deprecated.

eggonlea pushed a commit to eggonlea/tensorflow that referenced this issue Mar 12, 2019
cjolivier01 pushed a commit to Cerebras/tensorflow that referenced this issue Dec 6, 2019
keithm-xmos referenced this issue in xmos/tensorflow Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests