New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Training Randomly Stops During the Training Process #12667
Comments
Is there any chance you could have it generate a Python stack trace, or attach a GDB debugger to the process to get a backtrace of all the threads, so we can have a better idea of where the code is getting stuck at? |
@jart |
Some more information, this problem exists among all TF versions. |
@mrry, could you please take a look? |
There's nothing obviously wrong with the code you've shown, but without a minimal and complete reproduction, there's almost no chance we'll be able to trigger the same bug, which might be due to transient network connectivity issues between your containers. In general, for long-running training, I would recommend adding a watchdog process that monitors whether progress is still being made (e.g. whether checkpoints are still being written), and that restarts the cluster from a checkpoint when no progress is detected for (e.g.) 2x the checkpoint interval. |
@mrry Besides, when the cluster hung, the checkpoints will not be generated. And if the cluster is restarted, the cluster will work correctly until the next sudden hung. As mentioned by @passerbyd, is there any possibility that it is all the grpc's fault? |
About the checkpoints. |
I am experimenting something quite similar. With TensorFlow 1.4 and python 2.7. I run 2 workers (master + worker) one in each GPU and several parameter servers (I tried with 1 and 4). After a couple of hours, the master hangs and the worker keeps working. I run the master and the worker with the python debugger. It works for the worker but not for the master. It seems master is stopped in the C code somewhere. All processes are using the CPU, but master doesn't use any GPU and there is no progress in the training (nothing written in the log). So it could be master is in a loop somewhere. I have no idea how to debug it more to give you more information. |
This is the back trace where it hangs for me:
|
There's no such problem in "grpc + verbs" mode. #5394 |
To anyone facing hangs in the distributed mode, there was a bug in the version of gRPC used in TF 1.4 that would cause servers to stop serving after a (non-deterministic) period of time. This has been fixed at HEAD, and TensorFlow now uses a version of gRPC with the fix. I'd recommend trying to reproduce the problem with the |
@angersson I have been using a nightly build for the last week and no stop happening anymore. I am not sure about TF 1.4.1 |
@angersson I ran my experiment again after updating, so far It works fine and no stop happens |
Great, thanks for confirming! |
@mrry Is there the same issue with grpc used in tf1.7.0? |
@mrry I have encountered hang issue 3 times since last week in tf1.9.0 on a 8x8 GPUs cluster.
|
@chenjiasheng, anything new ? i see the same call stack ... |
@zrss My colleague has avoided this hanging issue, or at least reduced the hanging probability, by making an explicit call to mpi barrier at the end of each batch. We don't know why it works but it just goes that way. |
@chenjiasheng , oh thx a lot, we use the tf 1.8 with horovod nccl, and ran at 8 * 4 GPU cluster, and it hang at the middle of trainning with all GPUs util 100%; we print the backtrace of threads, it is very similar to your case ... |
@zrss I met the same problem, have you solved it? |
System information
Describe the problem
In my distributed training program, there are one server and two workers, which all run in separately nvidia-docker container. At the beginning, the cluster works just fine, but running normally after several hours, the two workers just stop.
My training process:
train_replica
function below after defining all necessary parts such as cluster_spec, inference function, data batch and so on.Source code / logs
My trainer function:
The text was updated successfully, but these errors were encountered: