tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154

YeBin2018 · 2024-02-02T06:16:45Z

when I use the default value of cross_device_ops, it'll core dump in jemalloc as below. when I choose cross_device_ops=tf.distribute.ReductionToOneDevice(), it still dosen't work, it get stuck. Does anyone know how to solve it?

02-01 21:57:21.779	E0201 21:57:21.779611 2689 log.cpp:10] @ 0x7fa3d0809ec6 _ZN4brpc5PrintERSoP6ssl_stPKc.cold
	02-01 21:57:21.770	E0201 21:57:21.769979 2689 log.cpp:10] @ 0x7fa3d0809c37 _ZN4brpc19NamingServiceThread7Actions12ResetServersERKSt6vectorINS_10ServerNodeESaIS3_EE.cold
	02-01 21:57:21.757	E0201 21:57:21.757550 2689 log.cpp:10] @ 0x7fa3d08098de ZSt16__introsort_loopIN9__gnu_cxx17__normal_iteratorIPN4brpc10ServerNodeESt6vectorIS3_SaIS3_EEEElNS0_5__ops15_Iter_less_iterEEvT_SB_T0_T1.isra.0.cold
	02-01 21:57:21.722	E0201 21:57:21.722803 2689 log.cpp:10] @ 0x7fa3d072f46c (unknown)
	02-01 21:57:21.698	E0201 21:57:21.697870 2689 log.cpp:10] @ 0x7faa6f5b5768 do_rallocx
	02-01 21:57:21.660	E0201 21:57:21.660003 2689 log.cpp:10] @ 0x7faa6f642703 prof_recent_alloc_restore_locked.isra.0
	02-01 21:57:21.626	E0201 21:57:21.626035 2689 log.cpp:10] @ 0x7faa6f5c3b12 realloc
	02-01 21:57:21.600	E0201 21:57:21.600236 2689 log.cpp:10] @ 0x7faa6f62ab0b hpa_try_alloc_batch_no_grow
	02-01 21:57:21.578	E0201 21:57:21.577888 2689 log.cpp:10] @ 0x7faa6f62a381 hpa_shard_maybe_do_deferred_work
	02-01 21:57:21.558	E0201 21:57:21.558209 2689 log.cpp:10] @ 0x7faa6f61cec1 je_edata_avail_remove_first
	02-01 21:57:21.530	E0201 21:57:21.530472 2689 log.cpp:10] @ 0x7faa6f62b3ac hpa_alloc
	02-01 21:57:21.496	E0201 21:57:21.496141 2689 log.cpp:10] @ 0x7faa6f33fdac (unknown)
	02-01 21:57:21.439	E0201 21:57:21.439801 2689 log.cpp:10] @ 0x7faa6f263520 (unknown)
	02-01 21:57:21.411	E0201 21:57:21.411706 2689 log.cpp:10] * SIGSEGV (@0x0) received by PID 117 (TID 0x7f8630c6c640) from PID 0; stack trace: *全部分词
	02-01 21:57:21.411	E0201 21:57:21.411311 2689 log.cpp:10] PC: @ 0x0 (unknown)
	02-01 21:57:21.377	E0201 21:57:21.377116 2689 log.cpp:10] * Aborted at 1706795841 (unix time) try "date -d @1706795841" if you are using GNU date *

YeBin2018 · 2024-02-02T07:25:08Z

I tried using the default value of cross_device_ops, and now it get stuck, repeating print log "Local rendezvous recv item cancelled. Key hash: 15504120126296904051". Anyone knows something about this?

laxmareddyp · 2024-02-02T17:56:07Z

Hi @YeBin2018 ,

Could you please provide the reproducible code/colab notebook to provide the support and provide the environment details to get complete understanding of the issue you are facing.Meanwhile For support-related issues, consider seeking assistance from the dedicated research models forum on TensorFlow Forum and StockoverFlow.These forum benefits from a large user base, increasing the potential for a swift resolution to your technical inquiry.

Thanks

YeBin2018 · 2024-02-04T12:54:51Z

Sorry, it is not convenient to provide the source code because it may involve company secrets. Our environment is: H800 machine, one machine has eight cards, using all-reduce architecture. The version of tensorflow is 2.14, using the Docker image provided by NVIDIA. I want to know what does it mean to print this log repeatedly? Because I looked at the tensorflow source code, it is difficult to trace the cause of this log -- “I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash:”

YeBin2018 added models:official models that come under official repository type:bug Bug in the code labels Feb 2, 2024

google-ml-butler bot assigned laxmareddyp Feb 2, 2024

laxmareddyp added stat:awaiting response Waiting on input from the contributor type:support and removed type:bug Bug in the code labels Feb 2, 2024

google-ml-butler bot removed the stat:awaiting response Waiting on input from the contributor label Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154

tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154

YeBin2018 commented Feb 2, 2024

YeBin2018 commented Feb 2, 2024

laxmareddyp commented Feb 2, 2024 •

edited

YeBin2018 commented Feb 4, 2024

tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154

tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154

Comments

YeBin2018 commented Feb 2, 2024

YeBin2018 commented Feb 2, 2024

laxmareddyp commented Feb 2, 2024 • edited

YeBin2018 commented Feb 4, 2024

laxmareddyp commented Feb 2, 2024 •

edited