Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

参考tutorial中进行dist.yaml的分布式训练时,worker产生了Unimplemented和Unavailable的报错 #270

Open
LucasTsui0725 opened this issue Jul 18, 2023 · 4 comments
Assignees

Comments

@LucasTsui0725
Copy link

企业微信截图_3cc11bd8-b82a-4c38-9547-6bc6cb892963

@LucasTsui0725
Copy link
Author

企业微信截图_7442d23c-ea11-475f-92dc-b42fb7457457

@LucasTsui0725 LucasTsui0725 changed the title 参考tutorial中进行dist.yaml的分布式训练时,worker产生了Unimplemented的报错 参考tutorial中进行dist.yaml的分布式训练时,worker产生了Unimplemented和Unavailable的报错 Jul 18, 2023
@Seventeen17
Copy link
Collaborator

Could you please let me know which version of the code you are using?

@LucasTsui0725
Copy link
Author

目前使用的版本为从Pypi上直接下载下来的graphlearn v1.1.0版本 参考 #233 对import进行了调整 部署环境为ubuntu 20.04 + gcc 9.4.0 + python 3.8.16 + tf 2.4.3 能完成example中ego_sage的单机训练任务 但是分布式训练出现问题

@Seventeen17
Copy link
Collaborator

目前使用的版本为从Pypi上直接下载下来的graphlearn v1.1.0版本 参考 #233 对import进行了调整 部署环境为ubuntu 20.04 + gcc 9.4.0 + python 3.8.16 + tf 2.4.3 能完成example中ego_sage的单机训练任务 但是分布式训练出现问题

你可以检查一下PS的内存是否OOM,以及可以增加参数设置 gl.set_retry_times(15)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants