New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 6th No.34】support return micro batch loss for dygraph train_batch #64218
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@ForFishes ci 没啥问题了,还麻烦研发老师帮忙 review 一下 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@AndSonder 看下覆盖率 |
@luotao1 PR-CI-Coverage 申请豁免, PipelineParallelWithInterleave 的覆盖率需要把单测加入到 test_parallel_dygraph_pipeline_parallel_with_virtual_stage 里面才行,但是 ci 的 2 卡环境测不了这个单测 |
…n_batch (PaddlePaddle#64218) * support return micro batch loss * fix codestyle * recover some code
PR Category
Auto Parallel
PR Types
New features
Description
支持动态图流水并行时返回 micro batch 的 loss
主要思路为将 self.total_loss 的累加策略更改为存储所有的 micro batch 的loss,当开启开关的时候将存储的 loss 合并为一个 tensor 返回,否则按照原来的逻辑将 loss 合并(求平均)。
在打开开关的时候,广播到其他卡的loss也是所有micro batch 的loss,参与计算 backward 的 loss 还是之前的 loss 没有变动这部分的逻辑