Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

流式训练demo报错 #923

Open
arbitraryking opened this issue May 15, 2023 · 3 comments
Open

流式训练demo报错 #923

arbitraryking opened this issue May 15, 2023 · 3 comments

Comments

@arbitraryking
Copy link

按照doc/online_trainer.md执行命令

(py37) E:\PaddleRec\PaddleRec\models\rank\slot_dnn>fleetrun --server_num=1 --worker_num=1 ../../../tools/static_ps_online_trainer.py -m config_online.yaml

Fatal error in launcher: Unable to create process using '"C:\ProgramData\Anaconda3\conda-bld\paddlepaddle-gpu_1676544693779\_h_env\python.exe"  "D:\Anacony37\Scripts\fleetrun.exe" --server_num=1 --worker_num=1 ../../../tools/static_ps_online_trainer.py -m config_online.yaml': ???????????

我看了下C:\ProgramData\目录下没有Anaconda3,这个python路径没有看到哪里能配置呢

@wangzhen38
Copy link
Collaborator

wangzhen38 commented May 15, 2023

你本地有安装paddle吗,可以先测试下单机版本能不能跑通

@arbitraryking
Copy link
Author

排序模型dnn的单机版本我跑通了,我安装的paddlepaddle:2.1.0,paddlepaddle-gpu:2.4.2.post116
slot_dnn的单机版本报错:

(py37) E:\PaddleRec\PaddleRec\models\rank\slot_dnn>python -u ../../../tools/static_trainer.py -m config_queuedataset.yaml
2023-05-15 16:32:44,707 - INFO - cpu_num: None
2023-05-15 16:32:44,708 - INFO - **************common.configs**********
2023-05-15 16:32:44,708 - INFO - use_gpu: False, use_xpu: False, use_visual: False, train_batch_size: 2, train_data_dir: data/, epochs: 3, print_interval: 10, model_save_path: output_model_benchdnn_queue
2023-05-15 16:32:44,708 - INFO - **************common.configs**********
2023-05-15 16:32:45,986 - INFO - File list: ['E:\\PaddleRec\\PaddleRec\\models\\rank\\slot_dnn\\data//demo_10']
train file_list: ['E:\\PaddleRec\\PaddleRec\\models\\rank\\slot_dnn\\data//demo_10']
parse ins id: None
utils_path: E:\PaddleRec\PaddleRec\tools\utils\static_ps
abs_train_reader is: E:\PaddleRec\PaddleRec\models\rank\slot_dnn\criteo_reader
pipe_command is: python3.7 queuedataset_reader.py config_queuedataset.yaml E:\PaddleRec\PaddleRec\tools\utils\static_ps
dataset init thread_num: 1
2023-05-15 16:32:45,989 - INFO - Get Train Dataset
dataset get_reader thread_num: 1
2023-05-15 16:32:45,996 - INFO - AUC Reset To Zero: _generated_var_0
2023-05-15 16:32:45,996 - INFO - AUC Reset To Zero: _generated_var_1
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_2
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_3
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_4
device worker program id: 2348362127944
I0515 16:32:46.040287  4596 hogwild_worker.cc:270] worker 0 train cost 0 seconds, batch_num: 0
2023-05-15 16:32:46,048 - INFO - epoch: 0 done, epoch time: 0.05 s
Traceback (most recent call last):
  File "../../../tools/static_trainer.py", line 315, in <module>
    main(args)
  File "../../../tools/static_trainer.py", line 207, in main
    prefix='rec_static')
  File "E:\PaddleRec\PaddleRec\tools\utils\save_load.py", line 61, in save_static_model
    paddle.static.save(program, model_prefix)
  File "D:\Anaconda\envs\py37\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\framework.py", line 558, in __impl__
    return func(*args, **kwargs)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1876, in save
    param_dict = {p.name: get_tensor(p) for p in parameter_list}
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1876, in <dictcomp>
    param_dict = {p.name: get_tensor(p) for p in parameter_list}
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1872, in get_tensor
    t = global_scope().find_var(var.name).get_tensor()
ValueError: (InvalidArgument) The Variable type must be class phi::DenseTensor, but the type it holds is class phi::SelectedRows.
  [Hint: Expected holder_->Type() == VarTypeTrait<T>::kId, but received holder_->Type():8 != VarTypeTrait<T>::kId:7.] (at ..\paddle/fluid/framework/variable.h:58)

@wangzhen38
Copy link
Collaborator

我先复现下,确认后会及时修复哈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants