Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to build tf1.15 with cuda11 , to run tf1.x code on RTX 30XX? #167

Open
Fannhhyy opened this issue Jan 6, 2021 · 10 comments

Comments

@Fannhhyy
Copy link

Fannhhyy commented Jan 6, 2021

https://github.com/nvidia/tensorflow
This version tf1.15 can run with rtx30xx ,it only can run on Linux . I tried to build it on win , but failed .

@fo40225
Copy link
Owner

fo40225 commented Jan 7, 2021

I haven't got the RTX3090, but I think you can try those steps.

  1. Download the latest CUDA toolkit, install the driver only.
  2. Download the CUDA toolkit of the same version as the CUDA toolkit used when compiling the tensorflow binary file you are using, execute the installer, skip the driver install, install the CUDA runtime, check the PATH should contains the cuda bin folder.
  3. Download the cudnn of the same version as the cudnn used when compiling the tensorflow binary file you are using and the cudnn's cuda version should be same as the cuda rumtime you just install, place the .dll to to the cuda bin folder.

If you use this repo's whl, 1.15 is built with cuda 10.1.243_426.00 / cudnn 7.6.4.38 for cuda 10.1.
If you use the official pip package, I guess the cuda/cudnn version is 10.0/7.6.x.

@Fannhhyy
Copy link
Author

Fannhhyy commented Jan 8, 2021

我还没有RTX3090,但我认为您可以尝试这些步骤。

  1. 下载最新的CUDA工具包,仅安装驱动程序。
  2. 下载与编译所使用的tensorflow二进制文件时使用的CUDA工具包相同版本的CUDA工具包,执行安装程序,跳过驱动程序安装,安装CUDA运行时,检查PATH是否包含cuda bin文件夹。
  3. 下载与编译您使用的tensorflow二进制文件时使用的cudnn版本相同的cudnn,并且cudnn的cuda版本应与您刚安装的cuda rumtime相同,将.dll放入cuda bin文件夹中。

如果您使用此回购协议的whl,则将cuda 10.1.243_426.00 / cudnn 7.6.4.38用于cuda 10.1构建1.15。
如果您使用官方的pip套件,我猜cuda / cudnn的版本是10.0 / 7.4.x。

Both version not work , the result of model inference is wrong .

@fo40225
Copy link
Owner

fo40225 commented Jan 11, 2021

You should use the CPU version of tensorflow to confirm that your model and code worked.

A misconfigured CUDA environment usually causes exceptions and exit.

@Fannhhyy
Copy link
Author

Fannhhyy commented Jan 17, 2021

You should use the CPU version of tensorflow to confirm that your model and code worked.

A misconfigured CUDA environment usually causes exceptions and exit.

I have a machine with three graphics cards --- GTX1080ti,RTX2080ti,RTX3070. Only RTX3070 not work.

@fo40225
Copy link
Owner

fo40225 commented Jan 20, 2021

所以你有一台機器上面安裝了三個世代的顯示卡,使用相同版本的驅動程式版本與CUDA函式庫與tf版本與原始碼跟模型
但只有安培顯卡得到錯誤結果
您可能真的遇到了舊版CUDA/cudnn在新顯卡上的bug

可以先試試將%APPDATA%\NVIDIA\ComputeCache清空,設定環境變數CUDA_CACHE_MAXSIZE=4294967295看能不能解決問題

要使用CUDA 11/cudnn 8建置原始的tf1.15,可能需要做非常多移植
修好NVIDIA版本的source code在windows上的建置問題應該比較簡單

@Fannhhyy
Copy link
Author

所以你有一台機器上面安裝了三個世代的顯示卡,使用相同版本的驅動程式版本與CUDA函式庫與tf版本與原始碼跟模型
但只有安培顯卡得到錯誤結果
您可能真的遇到了舊版CUDA/cudnn在新顯卡上的bug

可以先試試將%APPDATA%\NVIDIA\ComputeCache清空,設定環境變數CUDA_CACHE_MAXSIZE=4294967295看能不能解決問題

要使用CUDA 11/cudnn 8建置原始的tf1.15,可能需要做非常多移植
修好NVIDIA版本的source code在windows上的建置問題應該比較簡單

对,我的同一台机器有三代显卡,同时跑keras的范例代码。
清空缓存和使用环境变量使之不使用缓存我都试过,都不能正常工作,基于cuda11的tf2.4甚至工作也不正常,tf2.5 dev才能正常工作,但是我们的代码很难迁移过去。
正在尝试编译nvidia版本,但是他硬编码了一部分东西导致无法在win下编译。
请问您是居住大陆吗,如果您需要,我可以将rtx3070借给您。

@fo40225
Copy link
Owner

fo40225 commented Jan 21, 2021

方便說明一下您使用keras的範例重現問題的步驟嗎?

我想我應該能借到3090來做測試

@Fannhhyy
Copy link
Author

方便說明一下您使用keras的範例重現問題的步驟嗎?

我想我應該能借到3090來做測試

使用tf1.15和keras2.3,keras\examples\cifar10_resnet.py 这样的案例都无法训练,训练会导致NaN。

@fo40225
Copy link
Owner

fo40225 commented Jan 26, 2021

Test result

Windows
AMD Ryzen 7 5800x
gigabyte x570 aorus elite F30
4x ADATA DDR4-3200 32GB
Crucial P5 1TB
GIGABYTE RTX 3090 TURBO 24GB
Windows 10 Pro 1903
NVIDIA Driver 460.89
Anaconda 2020.02
keras 2.3.1

tensorflow-gpu 1.15.5 from pip
CUDA 10.0.130
CUDNN 7.6.5.32 for cuda10.0
error: CUBLAS_STATUS_EXECUTION_FAILED

tensorflow from this repo 1.15.0\py37\CPU+GPU\cuda101cudnn76avx2
CUDA 10.1.243
CUDNN 7.6.5.32 for cuda10.1
loss: nan

Linux
2x Intel Xeon Gold 6248R
16x Samsung DDR4-2933 64GB ECC RDIMM
Samsung PM983 1.92TB
2x GIGABYTE RTX 3090 TURBO 24GB
ubuntu 20.04 5.4.0-62
NVIDIA Driver 460.32.03
kreas 2.3.1

nvcr.io/nvidia/tensorflow:20.03-tf1-py3 slow JIT, slow execute
nvcr.io/nvidia/tensorflow:20.06-tf1-py3 JIT, slow execute
nvcr.io/nvidia/tensorflow:20.07-tf1-py3 JIT, slow execute
nvcr.io/nvidia/tensorflow:20.08-tf1-py3 JIT, slow execute
nvcr.io/nvidia/tensorflow:20.09-tf1-py3 JIT, slow execute
nvcr.io/nvidia/tensorflow:20.10-tf1-py3 OK
nvcr.io/nvidia/tensorflow:20.11-tf1-py3 OK
nvcr.io/nvidia/tensorflow:20.12-tf1-py3 OK

@fo40225
Copy link
Owner

fo40225 commented Jan 26, 2021

已修復nvidia的程式碼 修改如下
NVIDIA/tensorflow#14

基於此PR建置的whl在
https://github.com/fo40225/tensorflow-windows-wheel/tree/master/1.15.4+nv20.12/

建置環境
visual studio 2019 16.8
cuda 11.1.1
cudnn 8.0.5.39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants