Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用deeprec的estimator分布式训练ps节点core dump #911

Open
supercocoa7654 opened this issue Jul 10, 2023 · 1 comment
Open

使用deeprec的estimator分布式训练ps节点core dump #911

supercocoa7654 opened this issue Jul 10, 2023 · 1 comment

Comments

@supercocoa7654
Copy link

supercocoa7654 commented Jul 10, 2023

根据文档编译出deeprec和estimator,启动ps训练core dump

INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['127.0.0.1:2222'], 'ps': ['127.0.0.1:2223'], 'worker': ['127.0.0.1:2224']}, 'task': {'index': 0, 'type': 'ps'}}
INFO:tensorflow:Using config: {'_model_dir': 'easyrec_deepfm', '_tf_random_seed': None, '_save_summary_steps': 1000, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb0381fddd8>, '_task_type': 'ps', '_task_id': 0, '_evaluation_master': '', '_master': 'grpc://127.0.0.1:2223', '_num_ps_replicas': 1, '_num_worker_replicas': 2, '_global_id_in_cluster': 2, '_is_chief': False}
I0710 09:41:36.169108 140395967924032 input.py:45] check_mode: False
I0710 09:41:36.169946 140395967924032 input.py:45] check_mode: False
I0710 09:41:36.170253 140395967924032 main.py:173] will use BestExporter, metric is auc, the bigger the better: 1
I0710 09:41:36.171174 140395967924032 input.py:45] check_mode: False
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Start Tensorflow server.
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
Fatal Python error: Aborted

Thread 0x00007fb07bcbc740 (most recent call first):
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/server_lib.py", line 184 in join
  File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 688 in run_ps
  File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 640 in run
  File "/home/pai/lib/python3.6/site-packages/easy_rec/python/compat/estimator_train.py", line 84 in train_and_evaluate
  File "/home/pai/lib/python3.6/site-packages/easy_rec/python/main.py", line 333 in _train_and_evaluate_impl
  File "/home/pai/pyml/src/main.py", line 180 in easyrec_main
  File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258 in _run_main
  File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312 in run
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "/home/pai/pyml/src/main.py", line 196 in <module>
  File "/home/pai/lib/python3.6/runpy.py", line 85 in _run_code
  File "/home/pai/lib/python3.6/runpy.py", line 193 in _run_module_as_main
./ps.sh: line 8:   308 Aborted                 (core dumped)

gdb信息

gdb python core
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 338]
[New LWP 308]
[New LWP 315]
[New LWP 318]
[New LWP 319]
[New LWP 320]
[New LWP 321]
[New LWP 324]
[New LWP 326]
[New LWP 341]
[New LWP 342]
[New LWP 346]
[New LWP 347]
[New LWP 348]
[New LWP 349]
[New LWP 350]
[New LWP 356]
[New LWP 343]
[New LWP 339]
[New LWP 340]
[New LWP 333]
[New LWP 332]
[New LWP 331]
[New LWP 330]
[New LWP 329]
[New LWP 328]
[New LWP 327]
[New LWP 325]
[New LWP 323]
[New LWP 322]
[New LWP 317]
[New LWP 316]
[New LWP 314]
[New LWP 312]
[New LWP 334]
[New LWP 335]
[New LWP 336]
[New LWP 337]
[New LWP 344]
[New LWP 345]
[New LWP 353]
[New LWP 358]
[New LWP 361]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python -m main --algo_lib easyrec --continue_train True --pipeline_config_path'.
Program terminated with signal SIGABRT, Aborted.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fafeeff5700 (LWP 338))]
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  <signal handler called>
#2  0x00007fb07b4d0fb7 in __GI___libc_sigaction (sig=2, act=0x7fafeeff3ff0, oact=0x0) at ../sysdeps/unix/sysv/linux/x86_64/sigaction.c:54
#3  0x0000000000000000 in ?? ()
(gdb)
@supercocoa7654
Copy link
Author

试了下_train_distribute为None就会core dump,但是原生tf1.15不会,能否fix兼容下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant