BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

MrMYHuang · 2018-01-11T08:34:29Z

Your system configuration

Operating system: CentOS 7.4.1708
Compiler: x86_64-conda_cos6-linux-gnu-g++, gcc version 7.2.0 (crosstool-NG)
Graphics card: nVIDIA GeForce GTX 1070
CUDA version (if applicable): 9.1
CUDNN version (if applicable): 7.0.5
BLAS: openblas 0.2.20
Python or MATLAB version (for pycaffe and matcaffe respectively):
Anaconda 3 5.0.1 64-bit Python 3.6.4

Steps to reproduce

As shown in this issue crosstool-NG compiled libraries compatibility problem, so I use Anaconda's gxx_linux-64 7.2.0 compiler to compile caffe (commit e93b5e2) on CentOS 7 with this
Makefile.config and these Anaconda packages (including libopenblas, leveldb, lmdb, opecv, protobuf, glog, gflags, py-boost, libboost, ...) and the following commands:

PATH=/cad/anaconda3/bin:/usr/bin make -j8
PATH=/cad/anaconda3/bin:/usr/bin make -j8 test
LD_LIBRARY_PATH=/usr/local/cuda/lib64 make runtest

However, make runtest failed at ./build/test/test_batch_reindex_layer.testbin with these error messages:

[----------] 2 tests from BatchReindexLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] BatchReindexLayerTest/3.TestForward
[       OK ] BatchReindexLayerTest/3.TestForward (3 ms)
[ RUN      ] BatchReindexLayerTest/3.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.68212591193169037, which exceeds threshold_ * scale, where
computed_gradient evaluates to -0.68212591193169037,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.01.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835761; objective+ = 3.4152213335738013; objective- = 3.4152213335738013
...

After countless caffe compilations and tests, finally, I find a workaround to this problem: I add a line NVCCFLAGS += -G to Makefile and it changes from

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
endif
...

to

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
        NVCCFLAGS += -G
endif
...

Then, compiling caffe again... and make runtest passes without failure!

I think the problem could be unrelated to Python and x86_64-conda_cos6-linux-gnu-g++ compiler, but related to nvcc (CUDA 9.1). So, this problem might be reproduced with other g++ compiler!? Actually, I see someone has the same problem with Ubuntu 16.04 + CUDA 9.1. I hope someone can fix this problem.

The text was updated successfully, but these errors were encountered:

Noiredd · 2018-01-11T16:52:48Z

Confirmed on a standard Ubuntu 16.04 build both by myself (with GCC 5.4.0 and NVCC 9.1.85) and others: first in #6140, but also on caffe-users (thread1, thread2, thread3, thread 4).

Your workaround is to add a -G flag to NVCC even for the standard build, correct? This flag causes generation of debug information for GPU code and disables all optimizations [ref] - the latter effect seems more relevant.

MrMYHuang · 2018-01-12T04:05:46Z

Hi Noiredd,

Your workaround is to add a -G flag to NVCC even for the standard build, correct?

Yes.

Additionally, I find another runtest failure which is unrelated to NVCCFLAGS += -G but related to opencv 3.3.1: if I enable opencv (by commenting USE_OPENCV := 0), the runtest will fail at
./build/test/test_net.testbin with these error messages:

Cuda number of devices: 1
Current device id: 0
Current device name: GeForce GTX 1070
[==========] Running 124 tests from 5 test cases.
[----------] Global test environment set-up.
[----------] 26 tests from NetTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NetTest/0.TestHasBlob
[       OK ] NetTest/0.TestHasBlob (593 ms)
[ RUN      ] NetTest/0.TestGetBlob
[       OK ] NetTest/0.TestGetBlob (2 ms)
...
[ RUN      ] NetTest/0.TestSharedWeightsResume
[       OK ] NetTest/0.TestSharedWeightsResume (0 ms)
[ RUN      ] NetTest/0.TestParamPropagateDown
[       OK ] NetTest/0.TestParamPropagateDown (1 ms)
[ RUN      ] NetTest/0.TestFromTo
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float> (3 ms)
[ RUN      ] NetTest/0.TestReshape
...
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float>
[  FAILED  ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice<double>

neilpanchal · 2018-01-14T06:42:07Z

Confirmed, the following tests fails on CUDA 9.1 and CUDNN 7.

**[  FAILED  ] 2 tests, listed below:
[  FAILED  ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**

I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

srivathsapv · 2018-01-18T20:37:38Z

@MrMYHuang's suggestion worked. You have to add NVCCFLAG += -G to Makefile and do

$ make clean & make all & make test & make runtest

MrMYHuang · 2018-01-19T05:36:15Z

I submitted a bug report to nVIDIA. An nVIDIA staff replied me that the (CUDA) development team has identified this (CUDA 9.1) issue and is planning to fix it in the next release. At this time, it is suggested to use CUDA 9.0.

yssaya · 2018-01-22T12:37:54Z

I also had this problem.

CUDA 9.1 + cuDNN 7.0.5 + caffe [ FAILED ] 1 tests. mnist OK. my net failed
CUDA 9.0 + cuDNN 7.0.5 + caffe [ FAILED ] 2 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 7.0.5 + caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 6.0.21+ caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 5.0.5 + caffe-rc5 [ PASSED ] 2123 tests. mnist OK. my net OK.
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 [ PASSED ] 2123 tests. mnist OK. my net OK.

GTX 1080, Ubuntu 16.04.3, Driver Version: 387.34, i7 980X 3.3GHz, P6T-SE, RAM 6GB

CUDA 8.0 + cuDNN 7.0.5 passed "make runtest", and passed Caffe's mnist training.
But my net training failed. Its accuracy got 100% around 300 training iterations.
Using CPU training was OK. I gave up latest Caffe. Finally,
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 was OK.

My net is for computer Go. It predicts next move.
12 conv layers, 128 channels, kernel_size is 3, without batch normalization.

MacwinWin · 2018-01-22T13:50:39Z

Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7.
[ FAILED ] 2 tests, listed below: [ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> [ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

bdmccord · 2018-02-04T04:41:09Z

I am also using CUDA 9.1 and CUDNN v7.0.5 and can confirm this failure. I actually came here to post another test failure I had but when I was about to I noticed that disabling multi-gpu's fixed that test failure and presented this one I will post that in a seperate issue though.

edit: actually after unsetting the "CUDA_VISIBLE_DEVICES" variable the other issue I am referring to is no longer occurring oddly. I guess I won't post an issue for that right now until I can get the log to be generated again. I might not have re-enabled multi-gpu support properly.

ghost · 2018-03-03T10:15:25Z

Unfortunately with latest nvcc patch 2 released problem with
BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice
still persist.

xkszltl · 2018-04-04T03:18:30Z

Issue still exists, with phyical machine + single Pascal + CentOS 7 + nvcc 9.1.85 + cudnn 7.0.5

evilmtv · 2018-04-08T20:04:41Z

Issue exists for me too:
Physical Machine
Ubuntu 16.04
Nvidia drivers: 390.48
CUDA: 9.1.85 + Patch 1,2,3
cuDNN: cuDNN v7.1.2

Got:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>

After adding "NVCCFLAGS += -G" as OP suggested, no error and all passed.

But what does this mean for us when the flag is added?
i.e. (Are the optimizations only disabled during make or completely?)

weinman · 2018-04-20T12:49:15Z

I too had this error with @evilmtv's setup (except Ubuntu 14.04). I wanted to try following @Noiredd's suggestion and see whether this problem could be fixed by only changing the optimization level with the --optimize flag (rather than the -G flag).

Short answer: no. The -G is the needed work around.

After changing Makefile.config so that NVCCFLAGS += --optimize 0 (or NVCCFLAGS += -O0) and removing the -O2 entry from COMMON_FLAGS in the Makefile (line 322) so as to avoid an error caused by repeating the flag, the same tests failed.

ghost · 2018-04-20T16:15:43Z

Same problem occured when compiling under Gentoo Linux
gcc - 6.4.0
cuda 9.1.85
glibc 2.26-r6
and caffe compiled without python support

NVCCFLAGS += -G fixed it.

lubagov · 2018-05-03T18:28:51Z

GPU: Nvidia GT 1030
Ubuntu 16.04, kernel 4.10.0-28-generic
Driver: 387.34
caffe: commit 8645207
CUDNN: 7.0.5.15-1+cuda9.1
CuBLAS: 9.1.85.3-1
Cuda-NVCC: 9.1.85.2-1

4 trsts failed:
[==========] 2199 tests from 285 test cases ran. (343419 ms total)
[ PASSED ] 2195 tests.
[ FAILED ] 4 tests, listed below:
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice

Try add option "-G" to Makefile, does not fix... in any case 3 test failed..
I try train ResNet 34, and ResNet 18, network, (for make it faster on 6 images) and next try to run it on CPU. It is not working after training... But mnist, normal working, and simplified bottleneck ResNet50 too working. Don't know depended it this unit tests or not....

lubagov · 2018-05-04T09:52:13Z

I rebuild with CUDA 8.0 + cuDNN 6.0.21, with disabled OpenCV, and all tests passed.
But, before i use OpenCV 2.4 from ubuntu repo, not 3.3.1.
Without OpenCV ofcose i fave not ImageData layer, it use imread, and cv::mat to load image files, and it is not good for me.

Ok, i solve all tests, and now training/run all network work fine. In CPU and in GPU same.
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice

This 2 falue depended to last version MKL 2018.2.199. When i replace it to Atlas, it fine working, and this 2 test passed.
CUDA 8.0 + cuDNN 6.0.21 + Atlas 3.10.2-9, work for me....

cdluminate · 2018-05-27T05:35:24Z

Confirmed.
Debian SId,
GCC-6/CUDA 9.1/Nvidia 390.48

cdluminate · 2018-05-27T05:39:43Z

Has anyone tested CUDA 9.2?

xkszltl · 2018-05-27T06:31:14Z

@cdluminate All tests passed with latest commit + CUDA 9.2 + gcc 7.3.1

cdluminate · 2018-05-27T06:42:02Z

@xkszltl Thanks. That means I can remove the temporary fix from Debian/Ubuntu's pre-built binary package as long as CUDA 9.2 is available. With -G enabled for nvcc, the performance drop looks significant ...

xkszltl · 2018-05-28T09:41:34Z

@cdluminate
Don't....simply trust me....
Experience may vary by system and...luck...๑乛◡乛๑

BTW I'm on CentOS

jeiks · 2018-06-15T03:45:07Z

Not working here with latest commit + libcudnn7 (7.1.4.18-1+cuda9.2) + cuda 9.2 + gcc (5.4)
=/

MrMYHuang · 2018-07-01T02:24:06Z

All tests passed with commit 8645207 + CentOS 7.5.1804 + CUDA 9.2 + CUDNN 7.1 + gcc 4.8.5!

meriem87 · 2019-01-29T21:27:02Z

Not working for me. Details of problem in following link :
#6686

xkszltl · 2019-01-30T04:14:41Z

@meriem87 You issue looks unrelated to this.

Swjtu-only · 2020-02-05T02:39:39Z

$ make clean & make all & make test & make runtest

Are you root?

jeiks · 2020-02-05T02:50:17Z

No, you aren't. You are a normal user with GPU access.

…

On Tue, Feb 4, 2020, 23:39 Swjtu-only ***@***.***> wrote: $ make clean & make all & make test & make runtest Are you root? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ> .

Swjtu-only · 2020-02-05T02:57:50Z

No, you aren't. You are a normal user with GPU access.
…
On Tue, Feb 4, 2020, 23:39 Swjtu-only @.***> wrote: $ make clean & make all & make test & make runtest Are you root? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ .

Thanks,if i am a normal GPU access,i will meet permissions issue.
So,i decided run one by one and lucky everything is ok for me.

Noiredd changed the title ~~Testing Caffe Fail with CUDA 9.1~~ BatchReindexLayer fails GPU gradient tests under CUDA v9.1 Jan 11, 2018

Noiredd mentioned this issue Jan 11, 2018

make runtest failed #6140

Closed

Noiredd added the bug label Jan 11, 2018

Noiredd mentioned this issue Feb 27, 2018

Makefile:533: recipe for target 'runtest' failed #6260

Closed

MrMYHuang closed this as completed Jul 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

MrMYHuang commented Jan 11, 2018 •

edited

Noiredd commented Jan 11, 2018 •

edited

MrMYHuang commented Jan 12, 2018 •

edited

neilpanchal commented Jan 14, 2018 •

edited

srivathsapv commented Jan 18, 2018

MrMYHuang commented Jan 19, 2018 •

edited

yssaya commented Jan 22, 2018

MacwinWin commented Jan 22, 2018

bdmccord commented Feb 4, 2018 •

edited

ghost commented Mar 3, 2018

xkszltl commented Apr 4, 2018

evilmtv commented Apr 8, 2018

weinman commented Apr 20, 2018

ghost commented Apr 20, 2018

lubagov commented May 3, 2018 •

edited

lubagov commented May 4, 2018 •

edited

cdluminate commented May 27, 2018

cdluminate commented May 27, 2018

xkszltl commented May 27, 2018 •

edited

cdluminate commented May 27, 2018

xkszltl commented May 28, 2018 •

edited

jeiks commented Jun 15, 2018

MrMYHuang commented Jul 1, 2018

meriem87 commented Jan 29, 2019

xkszltl commented Jan 30, 2019

Swjtu-only commented Feb 5, 2020

jeiks commented Feb 5, 2020 via email

Swjtu-only commented Feb 5, 2020

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

BatchReindexLayer fails GPU gradient tests under CUDA v9.1 #6164

Comments

MrMYHuang commented Jan 11, 2018 • edited

Your system configuration

Steps to reproduce

Noiredd commented Jan 11, 2018 • edited

MrMYHuang commented Jan 12, 2018 • edited

neilpanchal commented Jan 14, 2018 • edited

srivathsapv commented Jan 18, 2018

MrMYHuang commented Jan 19, 2018 • edited

yssaya commented Jan 22, 2018

MacwinWin commented Jan 22, 2018

bdmccord commented Feb 4, 2018 • edited

ghost commented Mar 3, 2018

xkszltl commented Apr 4, 2018

evilmtv commented Apr 8, 2018

weinman commented Apr 20, 2018

ghost commented Apr 20, 2018

lubagov commented May 3, 2018 • edited

lubagov commented May 4, 2018 • edited

cdluminate commented May 27, 2018

cdluminate commented May 27, 2018

xkszltl commented May 27, 2018 • edited

cdluminate commented May 27, 2018

xkszltl commented May 28, 2018 • edited

jeiks commented Jun 15, 2018

MrMYHuang commented Jul 1, 2018

meriem87 commented Jan 29, 2019

xkszltl commented Jan 30, 2019

Swjtu-only commented Feb 5, 2020

jeiks commented Feb 5, 2020 via email

Swjtu-only commented Feb 5, 2020

MrMYHuang commented Jan 11, 2018 •

edited

Noiredd commented Jan 11, 2018 •

edited

MrMYHuang commented Jan 12, 2018 •

edited

neilpanchal commented Jan 14, 2018 •

edited

MrMYHuang commented Jan 19, 2018 •

edited

bdmccord commented Feb 4, 2018 •

edited

lubagov commented May 3, 2018 •

edited

lubagov commented May 4, 2018 •

edited

xkszltl commented May 27, 2018 •

edited

xkszltl commented May 28, 2018 •

edited