Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时仅计算1个epoch的结果就停止训练的问题 #121

Open
EvanHan09 opened this issue Jun 3, 2019 · 5 comments
Open

训练时仅计算1个epoch的结果就停止训练的问题 #121

EvanHan09 opened this issue Jun 3, 2019 · 5 comments

Comments

@EvanHan09
Copy link

请问,楼主有没有遇到过在训练时python run_cnn.py train 开始后,只训练计算得到1个epoch 结果,就停止训练了?
我检查了显卡的显存占用,发现没有出现内存泄露问题。继而又尝试了两种显存的分配方式,①分配了0.4的显存 ②自动适应分配。得到的结果和上面一样,均只训练一个epoch就停止了。
Configuring TensorBoard and Saver... Loading training and validation data... Time usage: 0:00:11 2019-06-03 11:40:30.224462: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties: name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71 pciBusID: 0000:01:00.0 totalMemory: 6.00GiB freeMemory: 4.89GiB 2019-06-03 11:40:30.237900: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0 2019-06-03 11:40:30.996786: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-03 11:40:31.005045: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0 2019-06-03 11:40:31.010727: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N 2019-06-03 11:40:31.015885: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2457 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5) Training and evaluating... Epoch: 1 Iter: 0, Train Loss: 2.3, Train Acc: 10.94%, Val Loss: 2.3, Val Acc: 10.02%, Time: 0:00:02 *
能给解答一下吗?

@EvanHan09
Copy link
Author

我debug发现,到下面代码第一行这里,就没有继续运行下去了,这个运行优化是选取模型优化方法吗?新手理解可能不到位?
` session.run(model.optim, feed_dict=feed_dict) # 运行优化
total_batch += 1

        if total_batch - last_improved > require_improvement:
            # 验证集正确率长期不提升,提前结束训练
            print("No optimization for a long time, auto-stopping...")
            flag = True
            break  # 跳出循环`

@gaussic
Copy link
Owner

gaussic commented Jun 4, 2019

把这一段注释掉就不会停了

@EvanHan09
Copy link
Author

把这一段注释掉就不会停了

嗯呐,我后来解决了,原因是配置问题,我把CUDA驱动更新到10.0且相应tensorflow==1.12.0,就可以正常进行训练了,只是还有一个小问题,经常在运行的时候,会报错提示无法初始化:
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

@gaussic
Copy link
Owner

gaussic commented Jun 6, 2019

这个问题倒没有碰到过

@fanruifeng
Copy link

把这一段注释掉就不会停了

嗯呐,我后来解决了,原因是配置问题,我把CUDA驱动更新到10.0且相应tensorflow==1.12.0,就可以正常进行训练了,只是还有一个小问题,经常在运行的时候,会报错提示无法初始化:
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

我现在也遇到这个问题 请问 您解决了嘛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants