Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf1 upgrade to tf2,tf.distribute.MirroredStrategy core dump #11154

Open
YeBin2018 opened this issue Feb 2, 2024 · 3 comments
Open

tf1 upgrade to tf2,tf.distribute.MirroredStrategy core dump #11154

YeBin2018 opened this issue Feb 2, 2024 · 3 comments
Assignees
Labels
models:official models that come under official repository type:support

Comments

@YeBin2018
Copy link

when I use the default value of cross_device_ops, it'll core dump in jemalloc as below. when I choose cross_device_ops=tf.distribute.ReductionToOneDevice(), it still dosen't work, it get stuck. Does anyone know how to solve it?

02-01 21:57:21.779 E0201 21:57:21.779611 2689 log.cpp:10] @ 0x7fa3d0809ec6 _ZN4brpc5PrintERSoP6ssl_stPKc.cold  
  02-01 21:57:21.770 E0201 21:57:21.769979 2689 log.cpp:10] @ 0x7fa3d0809c37 _ZN4brpc19NamingServiceThread7Actions12ResetServersERKSt6vectorINS_10ServerNodeESaIS3_EE.cold
  02-01 21:57:21.757 E0201 21:57:21.757550 2689 log.cpp:10] @ 0x7fa3d08098de ZSt16__introsort_loopIN9__gnu_cxx17__normal_iteratorIPN4brpc10ServerNodeESt6vectorIS3_SaIS3_EEEElNS0_5__ops15_Iter_less_iterEEvT_SB_T0_T1.isra.0.cold
  02-01 21:57:21.722 E0201 21:57:21.722803 2689 log.cpp:10] @ 0x7fa3d072f46c (unknown)
  02-01 21:57:21.698 E0201 21:57:21.697870 2689 log.cpp:10] @ 0x7faa6f5b5768 do_rallocx
  02-01 21:57:21.660 E0201 21:57:21.660003 2689 log.cpp:10] @ 0x7faa6f642703 prof_recent_alloc_restore_locked.isra.0
  02-01 21:57:21.626 E0201 21:57:21.626035 2689 log.cpp:10] @ 0x7faa6f5c3b12 realloc
  02-01 21:57:21.600 E0201 21:57:21.600236 2689 log.cpp:10] @ 0x7faa6f62ab0b hpa_try_alloc_batch_no_grow
  02-01 21:57:21.578 E0201 21:57:21.577888 2689 log.cpp:10] @ 0x7faa6f62a381 hpa_shard_maybe_do_deferred_work
  02-01 21:57:21.558 E0201 21:57:21.558209 2689 log.cpp:10] @ 0x7faa6f61cec1 je_edata_avail_remove_first
  02-01 21:57:21.530 E0201 21:57:21.530472 2689 log.cpp:10] @ 0x7faa6f62b3ac hpa_alloc
  02-01 21:57:21.496 E0201 21:57:21.496141 2689 log.cpp:10] @ 0x7faa6f33fdac (unknown)
  02-01 21:57:21.439 E0201 21:57:21.439801 2689 log.cpp:10] @ 0x7faa6f263520 (unknown)
  02-01 21:57:21.411 E0201 21:57:21.411706 2689 log.cpp:10] *** SIGSEGV (@0x0) received by PID 117 (TID 0x7f8630c6c640) from PID 0; stack trace: ***全部分词
  02-01 21:57:21.411 E0201 21:57:21.411311 2689 log.cpp:10] PC: @ 0x0 (unknown)
  02-01 21:57:21.377 E0201 21:57:21.377116 2689 log.cpp:10] *** Aborted at 1706795841 (unix time) try "date -d @1706795841" if you are using GNU date ***
@YeBin2018 YeBin2018 added models:official models that come under official repository type:bug Bug in the code labels Feb 2, 2024
@YeBin2018
Copy link
Author

I tried using the default value of cross_device_ops, and now it get stuck, repeating print log "Local rendezvous recv item cancelled. Key hash: 15504120126296904051". Anyone knows something about this?

@laxmareddyp
Copy link
Collaborator

laxmareddyp commented Feb 2, 2024

Hi @YeBin2018 ,

Could you please provide the reproducible code/colab notebook to provide the support and provide the environment details to get complete understanding of the issue you are facing.Meanwhile For support-related issues, consider seeking assistance from the dedicated research models forum on TensorFlow Forum and StockoverFlow.These forum benefits from a large user base, increasing the potential for a swift resolution to your technical inquiry.

Thanks

@laxmareddyp laxmareddyp added stat:awaiting response Waiting on input from the contributor type:support and removed type:bug Bug in the code labels Feb 2, 2024
@YeBin2018
Copy link
Author

Sorry, it is not convenient to provide the source code because it may involve company secrets. Our environment is: H800 machine, one machine has eight cards, using all-reduce architecture. The version of tensorflow is 2.14, using the Docker image provided by NVIDIA. I want to know what does it mean to print this log repeatedly? Because I looked at the tensorflow source code, it is difficult to trace the cause of this log -- “I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash:”

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Waiting on input from the contributor label Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:official models that come under official repository type:support
Projects
None yet
Development

No branches or pull requests

2 participants