Distributed Training Randomly Stops During the Training Process #12667

lsy643 · 2017-08-29T02:32:00Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow version (use command below): v1.3.0-rc2-20-g0787eee 1.3.0
Python version: 3.5.2
CUDA/cuDNN version: 6.0
GPU model and memory: Tesla K80, 12G

Describe the problem

In my distributed training program, there are one server and two workers, which all run in separately nvidia-docker container. At the beginning, the cluster works just fine, but running normally after several hours, the two workers just stop.

My training process:

I create three nvidia-docker containers, one for parameter server, two for workers
In every container, I run the train_replica function below after defining all necessary parts such as cluster_spec, inference function, data batch and so on.
It works correctly at the beginning
It stops several hours later

Source code / logs

My trainer function:

def train_replica(cluster_spec,
                  get_data_batch,
                  inference_fn,
                  get_init_fn,
                  get_learning_rate,
                  get_optimizer,
                  get_train_variables,
                  replica_param,
                  train_param,
                  ):
    job_name = replica_param['job_name']
    task_index = replica_param['task_index']
    sync_replicas = train_param['sync_replicas']
    log_dir = train_param['log_dir']
    assert job_name in ['ps', 'worker']
    server = tf.train.Server(cluster_spec, job_name=job_name,
                             task_index=task_index, config=get_ps_session_config())
    if job_name == 'ps':
        server.join()
    else:
        is_chief = (task_index == 0)
        device_setter = tf.train.replica_device_setter(cluster=cluster_spec)
        with tf.Graph().as_default():
            with tf.device(device_setter):
                global_step = create_global_step()
                learning_rate = get_learning_rate(global_step)
                data_batch = get_data_batch()
                _ = inference_fn(data_batch)
                total_loss, task_loss = get_losses()
                optimizer = get_optimizer(learning_rate)
                if sync_replicas:
                    optimizer = tf.train.SyncReplicasOptimizer(
                        opt=optimizer,
                        replicas_to_aggregate=cluster_spec.num_tasks('worker'),
                        total_num_replicas=cluster_spec.num_tasks('worker'),
                        name='sync_replica_optimizer'
                    )
                train_op = slim.learning.create_train_op(
                    total_loss=total_loss,
                    optimizer=optimizer,
                    global_step=global_step,
                    variables_to_train=get_train_variables(),
                    clip_gradient_norm=train_param['clip_norm'],
                    gradient_multipliers=train_param['gradient_multipliers'],
                )
                init_fn = get_init_fn() if get_init_fn is not None else None
                scaffold = tf.train.Scaffold(
                    init_op=tf.global_variables_initializer())
                scaffold._init_fn = init_fn
                hooks = [tf.train.StopAtStepHook(train_param['train_steps'])]
                if sync_replicas is True:
                    hooks.append(optimizer.make_session_run_hook(is_chief))
                chief_only_hooks = [tf.train.LoggingTensorHook([total_loss, task_loss], 100)]
                step_ind = 0
                with tf.train.MonitoredTrainingSession(
                        master=server.target,
                        is_chief=is_chief,
                        checkpoint_dir=log_dir,
                        scaffold=scaffold,
                        hooks=hooks,
                        chief_only_hooks=chief_only_hooks,
                        config=get_worker_session_config(task_index)) as session:
                    while not session.should_stop():
                        session.run(train_op)
                        step_ind += 1
                        if step_ind % 1000 == 0:
                            tf.logging.debug('Training Step At {s}'.format(s=step_ind))

The text was updated successfully, but these errors were encountered:

jart · 2017-08-30T20:45:37Z

Is there any chance you could have it generate a Python stack trace, or attach a GDB debugger to the process to get a backtrace of all the threads, so we can have a better idea of where the code is getting stuck at?

lsy643 · 2017-09-03T06:16:10Z

@jart
I have run my codes again, and when the training cluster hung, I used gdb to attach the server and worker processes. The trace back is stored in log files.
ps.log.txt
worker1.log.txt
worker0.log.txt

passerbyd · 2017-09-12T03:26:56Z

Some more information, this problem exists among all TF versions.
The stuck problem only happens in SyncReplicasOptimizer. When tracing the stucked processes, all workers and PSes are waiting on mutex. PSes do not get all the gradients in this round, so the global_step won't be updated to start next round.
And the problem is related to the variables size in the model. If the variables are small enough(less data to transfer during iterations), the stuck doesn't happen. Not sure the exact threshold.

aselle · 2017-09-21T17:18:26Z

@mrry, could you please take a look?

mrry · 2017-09-21T21:59:26Z

There's nothing obviously wrong with the code you've shown, but without a minimal and complete reproduction, there's almost no chance we'll be able to trigger the same bug, which might be due to transient network connectivity issues between your containers. In general, for long-running training, I would recommend adding a watchdog process that monitors whether progress is still being made (e.g. whether checkpoints are still being written), and that restarts the cluster from a checkpoint when no progress is detected for (e.g.) 2x the checkpoint interval.

lsy643 · 2017-09-25T16:28:55Z

@mrry
I will try to provide a minimal and complete reproduction. Can I provide My Docker Image and corresponding script to run the Cluster?

Besides, when the cluster hung, the checkpoints will not be generated. And if the cluster is restarted, the cluster will work correctly until the next sudden hung.

As mentioned by @passerbyd, is there any possibility that it is all the grpc's fault?

passerbyd · 2017-09-28T07:35:12Z

About the checkpoints.
If train.supervisor is used, checkpoints will keep being written, as it's on a separated thread. So there is no problem about the whole TF process, but only in the optimizer.
As mentioned by @lsy643, everything works fine after restarting all processes.

jorgemf · 2017-11-30T22:26:23Z

I am experimenting something quite similar. With TensorFlow 1.4 and python 2.7. I run 2 workers (master + worker) one in each GPU and several parameter servers (I tried with 1 and 4). After a couple of hours, the master hangs and the worker keeps working. I run the master and the worker with the python debugger. It works for the worker but not for the master. It seems master is stopped in the C code somewhere. All processes are using the CPU, but master doesn't use any GPU and there is no progress in the training (nothing written in the log). So it could be master is in a loop somewhere. I have no idea how to debug it more to give you more information.

jorgemf · 2017-12-01T11:58:47Z

This is the back trace where it hangs for me:

#0  0x00007ffff7bc738d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007fffb19a248d in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=<optimized out>, __lock=...)
    at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007fffbc58c2fb in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007fffbc58bbe1 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007fffbc589134 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007fffbc589645 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) () from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00007fffba294c1b in tensorflow::(anonymous namespace)::WaitForNotification(tensorflow::CallOptions*, long long, tensorflow::Notification*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007fffba29573b in tensorflow::LocalMaster::RunStep(tensorflow::CallOptions*, tensorflow::RunStepRequestWrapper*, tensorflow::MutableRunStepResponseWrapper*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007fffba27ee09 in tensorflow::GrpcSession::RunProto(tensorflow::CallOptions*, tensorflow::MutableRunStepRequestWrapper*, tensorflow::MutableRunStepResponseWrapper*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007fffba27fc81 in tensorflow::GrpcSession::RunHelper(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, std::string const&) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007fffba2802fb in tensorflow::GrpcSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007fffba5084ea in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) () from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007fffba508824 in TF_Run ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007fffba22601a in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007fffba226411 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007fffba1ea6f1 in _wrap_TF_Run ()
   from /python2env/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007ffff74a4449 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#18 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#19 0x00007ffff747c80f in function_call.lto_priv () from /usr/lib/libpython2.7.so.1.0
#20 0x00007ffff74ccd93 in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#21 0x00007ffff74a76fe in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#22 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#23 0x00007ffff74aa283 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#24 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#25 0x00007ffff74aa283 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#26 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#27 0x00007ffff74aa283 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#28 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#29 0x00007ffff747c9b8 in function_call.lto_priv () from /usr/lib/libpython2.7.so.1.0
#30 0x00007ffff74ccd93 in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#31 0x00007ffff74a76fe in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#32 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#33 0x00007ffff747c9b8 in function_call.lto_priv () from /usr/lib/libpython2.7.so.1.0
#34 0x00007ffff74ccd93 in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#35 0x00007ffff746144f in instancemethod_call.lto_priv () from /usr/lib/libpython2.7.so.1.0
#36 0x00007ffff74ccd93 in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#37 0x00007ffff74a9c6e in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#38 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#39 0x00007ffff747c9b8 in function_call.lto_priv () from /usr/lib/libpython2.7.so.1.0
#40 0x00007ffff74ccd93 in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#41 0x00007ffff74a76fe in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#42 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#43 0x00007ffff74aa283 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#44 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#45 0x00007ffff74a9d7f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#46 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#47 0x00007ffff74a9d7f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#48 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#49 0x00007ffff74a9d7f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#50 0x00007ffff74a4b50 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#51 0x00007ffff74a4b50 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#52 0x00007ffff74a4b50 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#53 0x00007ffff74a4b50 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#54 0x00007ffff7503886 in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#55 0x00007ffff751905a in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
#56 0x00007ffff75207f1 in run_mod () from /usr/lib/libpython2.7.so.1.0
#57 0x00007ffff75220d5 in PyRun_FileExFlags () from /usr/lib/libpython2.7.so.1.0
#58 0x00007ffff75222aa in PyRun_SimpleFileExFlags () from /usr/lib/libpython2.7.so.1.0
#59 0x00007ffff7510863 in Py_Main () from /usr/lib/libpython2.7.so.1.0
#60 0x00007ffff7822f6a in __libc_start_main () from /usr/lib/libc.so.6
#61 0x000055555555478a in _start ()

passerbyd · 2017-12-04T04:49:43Z

There's no such problem in "grpc + verbs" mode. #5394

mrry · 2017-12-04T16:34:05Z

To anyone facing hangs in the distributed mode, there was a bug in the version of gRPC used in TF 1.4 that would cause servers to stop serving after a (non-deterministic) period of time. This has been fixed at HEAD, and TensorFlow now uses a version of gRPC with the fix. I'd recommend trying to reproduce the problem with the tf-nightly build to rule out that old bug as the source of the problem.

angerson · 2017-12-11T21:53:29Z

@lsy643, @jorgemf, can you confirm that the issue has been fixed for you at HEAD?

jorgemf · 2017-12-12T09:55:54Z

@angersson I have been using a nightly build for the last week and no stop happening anymore. I am not sure about TF 1.4.1

lsy643 · 2017-12-15T02:54:24Z

@angersson I ran my experiment again after updating, so far It works fine and no stop happens

mrry · 2017-12-15T18:38:30Z

Great, thanks for confirming!

lxn179208 · 2018-11-27T06:54:24Z

@mrry Is there the same issue with grpc used in tf1.7.0?

chenjiasheng · 2018-12-12T12:34:49Z

@mrry I have encountered hang issue 3 times since last week in tf1.9.0 on a 8x8 GPUs cluster.
by the way, the GPU bound to the hanging process is forever in 100% usage when it is hanging.

(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007fdaea31dfa4 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#2 0x00007fdaea31d771 in nsync::nsync_sem_wait_with_cancel(nsync::waiter, timespec, nsync::nsync_note_s_) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#3 0x00007fdaea31acb4 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#4 0x00007fdaea31b1d5 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007fdae9fa32cb in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007fdae9fa331b in tensorflow::DirectSession::WaitForNotification(tensorf---Type to continue, or q to quit---
low::DirectSession::RunState*, tensorflow::CancellationManager*, long long) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007fdae9fa6fd8 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007fdae9fae123 in tensorflow::DirectSession::RunCallable(long long, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007fdae723745d in tensorflow::(anonymous namespace)::RunCallableHelper(tensorflow::Session*, long, _object*, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>, TF_Buffer) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007fdae7237ddc in tensorflow::TF_SessionRunCallable(TF_Session*, long, _object*, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>, TF_Buffer)
()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorf---Type to continue, or q to quit---
low_internal.so
#11 0x00007fdae71ec0d2 in _wrap_TF_SessionRunCallable ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00000000004ea10f in PyCFunction_Call () at ../Objects/methodobject.c:109
#13 0x0000000000536d94 in call_function (oparg=,
pp_stack=0x7ffc2ac5b120) at ../Python/ceval.c:4705
#14 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#15 0x0000000000540b0b in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0,
closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0,
kws=, argcount=, args=,
locals=, globals=,
_co=<code at remote 0x7fdb80119300>) at ../Python/ceval.c:4018
#16 PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#17 0x00000000004ec2e3 in function_call.lto_priv ()
at ../Objects/funcobject.c:627
#18 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#19 0x00000000004fbfce in method_call.lto_priv ()
at ../Objects/classobject.c:330
#20 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#21 0x0000000000574db6 in slot_tp_call () at ../Objects/typeobject.c:6053
#22 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#23 0x0000000000538cab in ext_do_call (nk=, na=0,
---Type to continue, or q to quit---
flags=, pp_stack=0x7ffc2ac5b4c8,
func=<_Callable(_session=<Session(_opened=False, _session=<SwigPyObject at remote 0x7fdaa76e62d0>, _created_with_new_api=True, _current_version=0, _extend_lock=<_thread.lock at remote 0x7fdaa7741968>, _closed=False, _delete_lock=<_thread.lock at remote 0x7fdaa7741a30>, _default_session_context_manager=None, _dead_handles=[], _config=<ConfigProto at remote 0x7fdaa76d17c0>, _default_graph_context_manager=None, _graph=<Graph(_names_in_use={'rcnn_ctcv4/expand_conv1/add_n_1/placeholder_1': 1, 'training/adam/distributedadam_allreduce/horovodallreduce_training_adam_gradients_addn_216_0': 1, 'rcnn_ctcv4/expand_conv1/static_batch_normalization_4/moving_variance/read': 1, 'training/adam/zeros_62/const': 1, 'training/adam/gradients/rcnn_ctcv4/conv_block2/unit2/conv2d_20/weight_regularizer/sum_grad': 1, 'rcnn_ctcv4/conv_block2/unit2/static_batch_normalization_21/moving_variance/local_step/assign': 1, 'rescnn_1/static_batch_normalization_36/strided_slice_1/stack_1': 1, 'rcnn_ctcv4/conv_block2/unit3/static_batch_normalization_2...(truncated)) at ../Python/ceval.c:5034
#24 PyEval_EvalFrameEx () at ../Python/ceval.c:3275
#25 0x000000000053b294 in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5b5f0,
func=) at ../Python/ceval.c:4803
#26 call_function (oparg=, pp_stack=0x7ffc2ac5b5f0)
at ../Python/ceval.c:4730
#27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#28 0x0000000000540b0b in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0,
---Type to continue, or q to quit---
closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0,
kws=, argcount=, args=,
locals=, globals=,
_co=<code at remote 0x7fdb98aa49c0>) at ../Python/ceval.c:4018
#29 PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#30 0x00000000004ec2e3 in function_call.lto_priv ()
at ../Objects/funcobject.c:627
#31 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#32 0x00000000004fbfce in method_call.lto_priv ()
at ../Objects/classobject.c:330
#33 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#34 0x0000000000574db6 in slot_tp_call () at ../Objects/typeobject.c:6053
#35 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#36 0x000000000053b656 in do_call (nk=, na=,
pp_stack=0x7ffc2ac5b990, func=) at ../Python/ceval.c:4936
#37 call_function (oparg=, pp_stack=0x7ffc2ac5b990)
at ../Python/ceval.c:4732
#38 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#39 0x000000000053fc97 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#40 0x000000000053b83f in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5bba0,
func=) at ../Python/ceval.c:4813
#41 call_function (oparg=, pp_stack=0x7ffc2ac5bba0)
---Type to continue, or q to quit---
at ../Python/ceval.c:4730
#42 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#43 0x000000000053fc97 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#44 0x000000000053b83f in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5bdb0,
func=) at ../Python/ceval.c:4813
#45 call_function (oparg=, pp_stack=0x7ffc2ac5bdb0)
at ../Python/ceval.c:4730
#46 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#47 0x0000000000540b0b in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0,
closure=0x0, kwdefs=0x0, defcount=12, defs=0x7fdac724a8d0, kwcount=10,
kws=, argcount=, args=,
locals=, globals=,
_co=<code at remote 0x7fdac72aea50>) at ../Python/ceval.c:4018
#48 PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#49 0x00000000004ec3f7 in function_call.lto_priv ()
at ../Objects/funcobject.c:627
#50 0x00000000005c20e7 in PyObject_Call () at ../Objects/abstract.c:2165
#51 0x0000000000538cab in ext_do_call (nk=, na=0,
flags=, pp_stack=0x7ffc2ac5c068,
func=<function at remote 0x7fdac7251ea0>) at ../Python/ceval.c:5034
#52 PyEval_EvalFrameEx () at ../Python/ceval.c:3275
#53 0x00000000005401ef in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
---Type to continue, or q to quit---
#54 0x000000000053bc93 in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5c270,
func=) at ../Python/ceval.c:4813
#55 call_function (oparg=, pp_stack=0x7ffc2ac5c270)
at ../Python/ceval.c:4730
#56 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#57 0x000000000053b294 in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5c3a0,
func=) at ../Python/ceval.c:4803
#58 call_function (oparg=, pp_stack=0x7ffc2ac5c3a0)
at ../Python/ceval.c:4730
#59 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#60 0x000000000053b294 in fast_function (nk=,
na=, n=, pp_stack=0x7ffc2ac5c4d0,
func=) at ../Python/ceval.c:4803
#61 call_function (oparg=, pp_stack=0x7ffc2ac5c4d0)
at ../Python/ceval.c:4730
#62 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#63 0x000000000053fc97 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#64 0x00000000005409bf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#65 PyEval_EvalCode (co=, globals=,
locals=) at ../Python/ceval.c:777
#66 0x000000000060cb42 in run_mod () at ../Python/pythonrun.c:976
---Type to continue, or q to quit---
#67 0x000000000060efea in PyRun_FileExFlags () at ../Python/pythonrun.c:929
#68 0x000000000060f7dc in PyRun_SimpleFileExFlags ()
at ../Python/pythonrun.c:396
#69 0x0000000000640256 in run_file (p_cf=0x7ffc2ac5c740,
filename=0x218e2f0 L"train.py", fp=0x21f3330) at ../Modules/main.c:318
#70 Py_Main () at ../Modules/main.c:768
#71 0x00000000004d0001 in main () at ../Programs/python.c:65
#72 0x00007fdb9e90f830 in __libc_start_main (main=0x4cff20
, argc=10,
argv=0x7ffc2ac5c958, init=, fini=,
rtld_fini=, stack_end=0x7ffc2ac5c948)
at ../csu/libc-start.c:291
#73 0x00000000005d6999 in _start ()

zrss · 2019-05-11T08:00:10Z

@chenjiasheng, anything new ? i see the same call stack ...

chenjiasheng · 2019-05-11T10:22:07Z

@zrss My colleague has avoided this hanging issue, or at least reduced the hanging probability, by making an explicit call to mpi barrier at the end of each batch. We don't know why it works but it just goes that way.

zrss · 2019-05-11T14:51:13Z

@chenjiasheng , oh thx a lot, we use the tf 1.8 with horovod nccl, and ran at 8 * 4 GPU cluster, and it hang at the middle of trainning with all GPUs util 100%; we print the backtrace of threads, it is very similar to your case ...

jianyuheng · 2020-03-13T13:14:14Z

@zrss I met the same problem, have you solved it?

jart added the stat:awaiting response Status - Awaiting response from author label Aug 30, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 5, 2017

aselle added stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Sep 21, 2017

mrry removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 21, 2017

angerson added the stat:awaiting response Status - Awaiting response from author label Dec 11, 2017

mrry closed this as completed Dec 15, 2017

annaa-ka mentioned this issue Dec 13, 2023

Profiling hangs in cuda/cupti .so #62614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training Randomly Stops During the Training Process #12667

Distributed Training Randomly Stops During the Training Process #12667

lsy643 commented Aug 29, 2017

jart commented Aug 30, 2017

lsy643 commented Sep 3, 2017

passerbyd commented Sep 12, 2017

aselle commented Sep 21, 2017

mrry commented Sep 21, 2017

lsy643 commented Sep 25, 2017

passerbyd commented Sep 28, 2017

jorgemf commented Nov 30, 2017

jorgemf commented Dec 1, 2017

passerbyd commented Dec 4, 2017

mrry commented Dec 4, 2017

angerson commented Dec 11, 2017 •

edited

jorgemf commented Dec 12, 2017

lsy643 commented Dec 15, 2017

mrry commented Dec 15, 2017

lxn179208 commented Nov 27, 2018

chenjiasheng commented Dec 12, 2018 •

edited

zrss commented May 11, 2019

chenjiasheng commented May 11, 2019

zrss commented May 11, 2019 •

edited

jianyuheng commented Mar 13, 2020

Distributed Training Randomly Stops During the Training Process #12667

Distributed Training Randomly Stops During the Training Process #12667

Comments

lsy643 commented Aug 29, 2017

System information

Describe the problem

Source code / logs

jart commented Aug 30, 2017

lsy643 commented Sep 3, 2017

passerbyd commented Sep 12, 2017

aselle commented Sep 21, 2017

mrry commented Sep 21, 2017

lsy643 commented Sep 25, 2017

passerbyd commented Sep 28, 2017

jorgemf commented Nov 30, 2017

jorgemf commented Dec 1, 2017

passerbyd commented Dec 4, 2017

mrry commented Dec 4, 2017

angerson commented Dec 11, 2017 • edited

jorgemf commented Dec 12, 2017

lsy643 commented Dec 15, 2017

mrry commented Dec 15, 2017

lxn179208 commented Nov 27, 2018

chenjiasheng commented Dec 12, 2018 • edited

zrss commented May 11, 2019

chenjiasheng commented May 11, 2019

zrss commented May 11, 2019 • edited

jianyuheng commented Mar 13, 2020

angerson commented Dec 11, 2017 •

edited

chenjiasheng commented Dec 12, 2018 •

edited

zrss commented May 11, 2019 •

edited