New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorflow1.7 hangs at LocalMaster::RunStep with tf.train.MonitoredTrainingSession in sync mode #24338
Comments
Same issue on tf-1.12.0, and in our case tf hang at the last batch of dataset when the dataset epoch is 1.
|
@hgadig any advice? |
@lxn179208 Could you fill the issue template here. It will help us to find root-cause of the issue. Is it possible for you to use recent version and check whether the issue persists? Thanks! |
Closing due to lack of recent activity. Please update the issue when new information becomes available, and we will open a new issue. Thanks! |
met the same question gdb backtrace info: (gdb) backtrace |
I'm training gan on wavenet model with tf.train.MonitoredTrainingSession, with 1 ps and 2 workers. It hangs when using tf.train.SyncReplicasOptimizer, but works well in async mode. And it works well with Vanilla GAN demo in sync mode.
With debug the chief worker, I found master is waiting for worker to response. However, I don't know what the worker is doing, and which thread hangs?
So, how to debug this problem?
More infos: tenosrflow 1.7, M40, sorry can not paste the source code
chief worker gdb info as follows:
woker 1 gdb info:
The text was updated successfully, but these errors were encountered: