Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

Closed
MrMYHuang opened this issue Jan 11, 2018 · 27 comments
Closed

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

MrMYHuang opened this issue Jan 11, 2018 · 27 comments
Labels

Comments

@MrMYHuang
Copy link

MrMYHuang commented Jan 11, 2018

Your system configuration

Operating system: CentOS 7.4.1708
Compiler: x86_64-conda_cos6-linux-gnu-g++, gcc version 7.2.0 (crosstool-NG)
Graphics card: nVIDIA GeForce GTX 1070
CUDA version (if applicable): 9.1
CUDNN version (if applicable): 7.0.5
BLAS: openblas 0.2.20
Python or MATLAB version (for pycaffe and matcaffe respectively):
Anaconda 3 5.0.1 64-bit Python 3.6.4

Steps to reproduce

As shown in this issue crosstool-NG compiled libraries compatibility problem, so I use Anaconda's gxx_linux-64 7.2.0 compiler to compile caffe (commit e93b5e2) on CentOS 7 with this
Makefile.config and these Anaconda packages (including libopenblas, leveldb, lmdb, opecv, protobuf, glog, gflags, py-boost, libboost, ...) and the following commands:

PATH=/cad/anaconda3/bin:/usr/bin make -j8
PATH=/cad/anaconda3/bin:/usr/bin make -j8 test
LD_LIBRARY_PATH=/usr/local/cuda/lib64 make runtest

However, make runtest failed at ./build/test/test_batch_reindex_layer.testbin with these error messages:

[----------] 2 tests from BatchReindexLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] BatchReindexLayerTest/3.TestForward
[       OK ] BatchReindexLayerTest/3.TestForward (3 ms)
[ RUN      ] BatchReindexLayerTest/3.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.68212591193169037, which exceeds threshold_ * scale, where
computed_gradient evaluates to -0.68212591193169037,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.01.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835761; objective+ = 3.4152213335738013; objective- = 3.4152213335738013
...

After countless caffe compilations and tests, finally, I find a workaround to this problem: I add a line NVCCFLAGS += -G to Makefile and it changes from

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
endif
...

to

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
        NVCCFLAGS += -G
endif
...

Then, compiling caffe again... and make runtest passes without failure!

I think the problem could be unrelated to Python and x86_64-conda_cos6-linux-gnu-g++ compiler, but related to nvcc (CUDA 9.1). So, this problem might be reproduced with other g++ compiler!? Actually, I see someone has the same problem with Ubuntu 16.04 + CUDA 9.1. I hope someone can fix this problem.

@Noiredd Noiredd changed the title Testing Caffe Fail with CUDA 9.1 BatchReindexLayer fails GPU gradient tests under CUDA v9.1 Jan 11, 2018
@Noiredd Noiredd added the bug label Jan 11, 2018
@Noiredd
Copy link
Member

Noiredd commented Jan 11, 2018

Confirmed on a standard Ubuntu 16.04 build both by myself (with GCC 5.4.0 and NVCC 9.1.85) and others: first in #6140, but also on caffe-users (thread1, thread2, thread3, thread 4).

Your workaround is to add a -G flag to NVCC even for the standard build, correct? This flag causes generation of debug information for GPU code and disables all optimizations [ref] - the latter effect seems more relevant.

@MrMYHuang
Copy link
Author

MrMYHuang commented Jan 12, 2018

Hi Noiredd,

Your workaround is to add a -G flag to NVCC even for the standard build, correct?

Yes.

Additionally, I find another runtest failure which is unrelated to NVCCFLAGS += -G but related to opencv 3.3.1: if I enable opencv (by commenting USE_OPENCV := 0), the runtest will fail at
./build/test/test_net.testbin with these error messages:

Cuda number of devices: 1
Current device id: 0
Current device name: GeForce GTX 1070
[==========] Running 124 tests from 5 test cases.
[----------] Global test environment set-up.
[----------] 26 tests from NetTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NetTest/0.TestHasBlob
[       OK ] NetTest/0.TestHasBlob (593 ms)
[ RUN      ] NetTest/0.TestGetBlob
[       OK ] NetTest/0.TestGetBlob (2 ms)
...
[ RUN      ] NetTest/0.TestSharedWeightsResume
[       OK ] NetTest/0.TestSharedWeightsResume (0 ms)
[ RUN      ] NetTest/0.TestParamPropagateDown
[       OK ] NetTest/0.TestParamPropagateDown (1 ms)
[ RUN      ] NetTest/0.TestFromTo
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float> (3 ms)
[ RUN      ] NetTest/0.TestReshape
...
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float>
[  FAILED  ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice<double>

@neilpanchal
Copy link

neilpanchal commented Jan 14, 2018

Confirmed, the following tests fails on CUDA 9.1 and CUDNN 7.

**[  FAILED  ] 2 tests, listed below:
[  FAILED  ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**

I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

@srivathsapv
Copy link

@MrMYHuang's suggestion worked. You have to add NVCCFLAG += -G to Makefile and do

$ make clean & make all & make test & make runtest

@MrMYHuang
Copy link
Author

MrMYHuang commented Jan 19, 2018

I submitted a bug report to nVIDIA. An nVIDIA staff replied me that the (CUDA) development team has identified this (CUDA 9.1) issue and is planning to fix it in the next release. At this time, it is suggested to use CUDA 9.0.

@yssaya
Copy link

yssaya commented Jan 22, 2018

I also had this problem.

CUDA 9.1 + cuDNN 7.0.5 + caffe [ FAILED ] 1 tests. mnist OK. my net failed
CUDA 9.0 + cuDNN 7.0.5 + caffe [ FAILED ] 2 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 7.0.5 + caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 6.0.21+ caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 5.0.5 + caffe-rc5 [ PASSED ] 2123 tests. mnist OK. my net OK.
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 [ PASSED ] 2123 tests. mnist OK. my net OK.

GTX 1080, Ubuntu 16.04.3, Driver Version: 387.34, i7 980X 3.3GHz, P6T-SE, RAM 6GB

CUDA 8.0 + cuDNN 7.0.5 passed "make runtest", and passed Caffe's mnist training.
But my net training failed. Its accuracy got 100% around 300 training iterations.
Using CPU training was OK. I gave up latest Caffe. Finally,
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 was OK.

My net is for computer Go. It predicts next move.
12 conv layers, 128 channels, kernel_size is 3, without batch normalization.

@MacwinWin
Copy link

Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7.
[ FAILED ] 2 tests, listed below: [ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> [ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

@bdmccord
Copy link

bdmccord commented Feb 4, 2018

I am also using CUDA 9.1 and CUDNN v7.0.5 and can confirm this failure. I actually came here to post another test failure I had but when I was about to I noticed that disabling multi-gpu's fixed that test failure and presented this one I will post that in a seperate issue though.

edit: actually after unsetting the "CUDA_VISIBLE_DEVICES" variable the other issue I am referring to is no longer occurring oddly. I guess I won't post an issue for that right now until I can get the log to be generated again. I might not have re-enabled multi-gpu support properly.

@ghost
Copy link

ghost commented Mar 3, 2018

Unfortunately with latest nvcc patch 2 released problem with
BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice
still persist.

@xkszltl
Copy link

xkszltl commented Apr 4, 2018

Issue still exists, with phyical machine + single Pascal + CentOS 7 + nvcc 9.1.85 + cudnn 7.0.5

@evilmtv
Copy link

evilmtv commented Apr 8, 2018

Issue exists for me too:
Physical Machine
Ubuntu 16.04
Nvidia drivers: 390.48
CUDA: 9.1.85 + Patch 1,2,3
cuDNN: cuDNN v7.1.2

Got:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>

After adding "NVCCFLAGS += -G" as OP suggested, no error and all passed.

But what does this mean for us when the flag is added?
i.e. (Are the optimizations only disabled during make or completely?)

@weinman
Copy link
Contributor

weinman commented Apr 20, 2018

I too had this error with @evilmtv's setup (except Ubuntu 14.04). I wanted to try following @Noiredd's suggestion and see whether this problem could be fixed by only changing the optimization level with the --optimize flag (rather than the -G flag).

Short answer: no. The -G is the needed work around.

After changing Makefile.config so that NVCCFLAGS += --optimize 0 (or NVCCFLAGS += -O0) and removing the -O2 entry from COMMON_FLAGS in the Makefile (line 322) so as to avoid an error caused by repeating the flag, the same tests failed.

@ghost
Copy link

ghost commented Apr 20, 2018

Same problem occured when compiling under Gentoo Linux
gcc - 6.4.0
cuda 9.1.85
glibc 2.26-r6
and caffe compiled without python support

NVCCFLAGS += -G fixed it.

@lubagov
Copy link

lubagov commented May 3, 2018

GPU: Nvidia GT 1030
Ubuntu 16.04, kernel 4.10.0-28-generic
Driver: 387.34
caffe: commit 8645207
CUDNN: 7.0.5.15-1+cuda9.1
CuBLAS: 9.1.85.3-1
Cuda-NVCC: 9.1.85.2-1

4 trsts failed:
[==========] 2199 tests from 285 test cases ran. (343419 ms total)
[ PASSED ] 2195 tests.
[ FAILED ] 4 tests, listed below:
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice

Try add option "-G" to Makefile, does not fix... in any case 3 test failed..
I try train ResNet 34, and ResNet 18, network, (for make it faster on 6 images) and next try to run it on CPU. It is not working after training... But mnist, normal working, and simplified bottleneck ResNet50 too working. Don't know depended it this unit tests or not....

@lubagov
Copy link

lubagov commented May 4, 2018

I rebuild with CUDA 8.0 + cuDNN 6.0.21, with disabled OpenCV, and all tests passed.
But, before i use OpenCV 2.4 from ubuntu repo, not 3.3.1.
Without OpenCV ofcose i fave not ImageData layer, it use imread, and cv::mat to load image files, and it is not good for me.

Ok, i solve all tests, and now training/run all network work fine. In CPU and in GPU same.
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice

This 2 falue depended to last version MKL 2018.2.199. When i replace it to Atlas, it fine working, and this 2 test passed.
CUDA 8.0 + cuDNN 6.0.21 + Atlas 3.10.2-9, work for me....

@cdluminate
Copy link
Contributor

Confirmed.
Debian SId,
GCC-6/CUDA 9.1/Nvidia 390.48

@cdluminate
Copy link
Contributor

Has anyone tested CUDA 9.2?

@xkszltl
Copy link

xkszltl commented May 27, 2018

@cdluminate All tests passed with latest commit + CUDA 9.2 + gcc 7.3.1

@cdluminate
Copy link
Contributor

@xkszltl Thanks. That means I can remove the temporary fix from Debian/Ubuntu's pre-built binary package as long as CUDA 9.2 is available. With -G enabled for nvcc, the performance drop looks significant ...

@xkszltl
Copy link

xkszltl commented May 28, 2018

@cdluminate
Don't....simply trust me....
Experience may vary by system and...luck...๑乛◡乛๑

BTW I'm on CentOS

@jeiks
Copy link

jeiks commented Jun 15, 2018

Not working here with latest commit + libcudnn7 (7.1.4.18-1+cuda9.2) + cuda 9.2 + gcc (5.4)
=/

@MrMYHuang
Copy link
Author

All tests passed with commit 8645207 + CentOS 7.5.1804 + CUDA 9.2 + CUDNN 7.1 + gcc 4.8.5!

@meriem87
Copy link

Not working for me. Details of problem in following link :
#6686

@xkszltl
Copy link

xkszltl commented Jan 30, 2019

@meriem87 You issue looks unrelated to this.

@Swjtu-only
Copy link

$ make clean & make all & make test & make runtest

Are you root?

@jeiks
Copy link

jeiks commented Feb 5, 2020 via email

@Swjtu-only
Copy link

No, you aren't. You are a normal user with GPU access.

On Tue, Feb 4, 2020, 23:39 Swjtu-only @.***> wrote: $ make clean & make all & make test & make runtest Are you root? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ .

Thanks,if i am a normal GPU access,i will meet permissions issue.
So,i decided run one by one and lucky everything is ok for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests