Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] fix save eval result failed with mutil-node pretrain #678

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HoBeedzc
Copy link

在运行 4 节点 32卡的 LLaVA-InternLM2-20B 的预训练时,每次到eval阶段除master节点之外都会报错 FileNotExist,经过阅读 xtuner 和 mmengine 的代码后定位到问题:

mmengine在多节点训练时,默认只在master节点保存log/vis_data等信息,这会导致worker节点的没有 vis_data 这个文件夹,但是 xtuner 在保存eval结果的时候是每个节点都保存一份,而且在打开文件的时候没有做父文件夹是否存在的验证,因此导致了除master节点外都因为文件夹不存在而挂掉。。。

修复方式也很简单:保证只在master节点存储结果(利用mmengine提供的 master_only 装饰器),每次保存前利用mmengine提供的接口 mkdir_or_exist 进行文件夹存在性检查。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant