Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered - Training by 4 of 8 GPUs will fail #574

Open
beanliao opened this issue Jul 17, 2019 · 4 comments

Comments

@beanliao
Copy link

beanliao commented Jul 17, 2019

I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
#caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message:
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)

The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..."
Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.

I hope someone could check this issue. Thanks in advance.

Info:
NVIDIA Docker: Caffe:19.06
NVCaffe: 0.17.3
CuDNN: 7.6.0
NCCL : 2.4.7
Model : bvlc_googlenet
Batch size : 256

More logs:
I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto)
I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0
I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb
I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0
I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0
I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0
I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
*** Check failure stack trace: ***
I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8d0359052 caffe::Blob::CopyFrom()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d02bdbb2 caffe::InternalThread::entry()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d02bfc2f boost::detail::thread_data<>::run()
@ 0x7fa8cdcaf5d5 (unknown)
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8cd5686ba start_thread
@ 0x7fa8cdfcb41d clone
@ (nil) (unknown)

@drnikolaev
Copy link

@beanliao please attach your prototxt files if possible.

@beanliao
Copy link
Author

@beanliao please attach your prototxt files if possible.

Please refer to below files. Thanks.

solver_fp16_4.prototxt.txt
train_val_fp16_4.prototxt.txt

@drnikolaev
Copy link

@beanliao thank you. could you please run
nvidia-smi topo -m
and
nvidia-smi topo -p2p n
and paste outputs here?

@beanliao
Copy link
Author

@drnikolaev Thanks for checking this.
here's outputs

  GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X PIX PXB PXB SYS SYS SYS SYS
GPU1 PIX X PXB PXB SYS SYS SYS SYS
GPU2 PXB PXB X PIX SYS SYS SYS SYS
GPU3 PXB PXB PIX X SYS SYS SYS SYS
GPU4 SYS SYS SYS SYS X PIX PXB PXB
GPU5 SYS SYS SYS SYS PIX X PXB PXB
GPU6 SYS SYS SYS SYS PXB PXB X PIX
GPU7 SYS SYS SYS SYS PXB PXB PIX X
  GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NS NS NS NS NS NS NS
GPU1 NS X NS NS NS NS NS NS
GPU2 NS NS X NS NS NS NS NS
GPU3 NS NS NS X NS NS NS NS
GPU4 NS NS NS NS X NS NS NS
GPU5 NS NS NS NS NS X NS NS
GPU6 NS NS NS NS NS NS X NS
GPU7 NS NS NS NS NS NS NS X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants