-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164
Comments
Confirmed on a standard Ubuntu 16.04 build both by myself (with GCC 5.4.0 and NVCC 9.1.85) and others: first in #6140, but also on caffe-users (thread1, thread2, thread3, thread 4). Your workaround is to add a |
Hi Noiredd,
Yes. Additionally, I find another runtest failure which is unrelated to
|
Confirmed, the following tests fails on CUDA 9.1 and CUDNN 7.
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG. |
@MrMYHuang's suggestion worked. You have to add
|
I submitted a bug report to nVIDIA. An nVIDIA staff replied me that the (CUDA) development team has identified this (CUDA 9.1) issue and is planning to fix it in the next release. At this time, it is suggested to use CUDA 9.0. |
I also had this problem. CUDA 9.1 + cuDNN 7.0.5 + caffe [ FAILED ] 1 tests. mnist OK. my net failed GTX 1080, Ubuntu 16.04.3, Driver Version: 387.34, i7 980X 3.3GHz, P6T-SE, RAM 6GB CUDA 8.0 + cuDNN 7.0.5 passed "make runtest", and passed Caffe's mnist training. My net is for computer Go. It predicts next move. |
Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7. |
I am also using CUDA 9.1 and CUDNN v7.0.5 and can confirm this failure. I actually came here to post another test failure I had but when I was about to I noticed that disabling multi-gpu's fixed that test failure and presented this one I will post that in a seperate issue though. edit: actually after unsetting the "CUDA_VISIBLE_DEVICES" variable the other issue I am referring to is no longer occurring oddly. I guess I won't post an issue for that right now until I can get the log to be generated again. I might not have re-enabled multi-gpu support properly. |
Unfortunately with latest nvcc patch 2 released problem with |
Issue still exists, with phyical machine + single Pascal + |
Issue exists for me too: Got:
After adding "NVCCFLAGS += -G" as OP suggested, no error and all passed. But what does this mean for us when the flag is added? |
I too had this error with @evilmtv's setup (except Ubuntu 14.04). I wanted to try following @Noiredd's suggestion and see whether this problem could be fixed by only changing the optimization level with the Short answer: no. The After changing |
Same problem occured when compiling under Gentoo Linux NVCCFLAGS += -G fixed it. |
GPU: Nvidia GT 1030 4 trsts failed: Try add option "-G" to Makefile, does not fix... in any case 3 test failed.. |
I rebuild with CUDA 8.0 + cuDNN 6.0.21, with disabled OpenCV, and all tests passed. Ok, i solve all tests, and now training/run all network work fine. In CPU and in GPU same. This 2 falue depended to last version MKL 2018.2.199. When i replace it to Atlas, it fine working, and this 2 test passed. |
Confirmed. |
Has anyone tested CUDA 9.2? |
@cdluminate All tests passed with latest commit + CUDA 9.2 + gcc 7.3.1 |
@xkszltl Thanks. That means I can remove the temporary fix from Debian/Ubuntu's pre-built binary package as long as CUDA 9.2 is available. With |
@cdluminate BTW I'm on CentOS |
Not working here with latest commit + libcudnn7 (7.1.4.18-1+cuda9.2) + cuda 9.2 + gcc (5.4) |
All tests passed with commit 8645207 + CentOS 7.5.1804 + CUDA 9.2 + CUDNN 7.1 + gcc 4.8.5! |
Not working for me. Details of problem in following link : |
@meriem87 You issue looks unrelated to this. |
Are you root? |
No, you aren't.
You are a normal user with GPU access.
…On Tue, Feb 4, 2020, 23:39 Swjtu-only ***@***.***> wrote:
$ make clean & make all & make test & make runtest
Are you root?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ>
.
|
Thanks,if i am a normal GPU access,i will meet permissions issue. |
Your system configuration
Operating system: CentOS 7.4.1708
Compiler: x86_64-conda_cos6-linux-gnu-g++, gcc version 7.2.0 (crosstool-NG)
Graphics card: nVIDIA GeForce GTX 1070
CUDA version (if applicable): 9.1
CUDNN version (if applicable): 7.0.5
BLAS: openblas 0.2.20
Python or MATLAB version (for pycaffe and matcaffe respectively):
Anaconda 3 5.0.1 64-bit Python 3.6.4
Steps to reproduce
As shown in this issue crosstool-NG compiled libraries compatibility problem, so I use Anaconda's gxx_linux-64 7.2.0 compiler to compile caffe (commit e93b5e2) on CentOS 7 with this
Makefile.config and these Anaconda packages (including libopenblas, leveldb, lmdb, opecv, protobuf, glog, gflags, py-boost, libboost, ...) and the following commands:
However,
make runtest
failed at./build/test/test_batch_reindex_layer.testbin
with these error messages:After countless caffe compilations and tests, finally, I find a workaround to this problem: I add a line
NVCCFLAGS += -G
to Makefile and it changes fromto
Then, compiling caffe again... and
make runtest
passes without failure!I think the problem could be unrelated to Python and x86_64-conda_cos6-linux-gnu-g++ compiler, but related to nvcc (CUDA 9.1). So, this problem might be reproduced with other g++ compiler!? Actually, I see someone has the same problem with Ubuntu 16.04 + CUDA 9.1. I hope someone can fix this problem.
The text was updated successfully, but these errors were encountered: