-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU CI #389
Support GPU CI #389
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- matcher的分支都注释一下对应的解决情形
- 就维持两条单测流水线吧,原来的CPU流水线继续留着,耗费资源也不大,新的GPU流水线调通为止
- 只能在GPU跑的,看能否优化下更好的写法,目前这种result=None是比较trick的写法
paconvert/api_matcher.py
Outdated
kwargs["stream"] = "paddle.device.Stream()" | ||
API_TEMPLATE = textwrap.dedent( | ||
""" | ||
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该先将 torch.cuda.Stream
转成 paddle.device.cuda.Stream
,是里面这个torch.cuda.Stream
的Matcher有问题吗,才会将问题遗留到上一层的API来解决
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。
为何 这里会传入 paddle.base.core.CUDAStream,这个是一个很底层的概念,要从源头上解决问题
tests/test_device.py
Outdated
@@ -29,7 +29,17 @@ def compare( | |||
rtol=1.0e-6, | |||
atol=0.0, | |||
): | |||
assert str(pytorch_result).replace("cuda", "gpu") == str(paddle_result) | |||
if isinstance(paddle_result, bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有这么多可能性吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的。这个问题主要是因为paddle中并没有和torch.device完全等价的api。不同情况下可能要和paddle.CUDAPlace、int、str相对应,因此在测试相等的时候分支就较多。device的问题也记录到了登记表中
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那你直接判断
if isinstance(paddle_result, str
if isinstance(paddle_result, paddle.CPUPlace
if isinstance(paddle_result, CUDAPlace
这里的判断分支看不太明白
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修复
@@ -31,7 +31,7 @@ def test_case_1(): | |||
result = F.grid_sample(x, grid) | |||
""" | |||
) | |||
obj.run(pytorch_code, ["result"], check_value=False) | |||
obj.run(pytorch_code, ["result"], check_value=False, check_stop_gradient=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是API有bug吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是实现差异,不是bug。paddle中检测到使用cudnn实现时,会强制将stop_gradient设置为False。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是实现差异,不是bug。paddle中检测到使用cudnn实现时,会强制将stop_gradient设置为False。
问题是这个设置为False的实现是合理的吗?凡是有差异的地方,先要从API功能语义上推敲是否合理,这里说的 实现差异 需要有一个更合理的定义
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle和pytorch的grid_sample的GPU实现实现都是调用cudnn的接口,cudnn内部的反向实现会同时计算所有输入的梯度,无法mask只计算部分输入的梯度。因此paddle选择在检测到使用cudnn实现时, PaddlePaddle/Paddle@7de2db4 同时将输入x,grid的stop_gradient设置为False。 pytorch在grid_sample反向实现时也注明了反向会同时计算所有输入的梯度,但其没选择在前向做手工设置
https://github.com/pytorch/pytorch/blob/2ed17e0b1ec0ca2a5dea41f81a63f582f2792d22/aten/src/ATen/native/cudnn/GridSampler.cpp#L125 。paddle出于cudnn该实现的特点,将stop_gradient 人为设置为False也是合理的。paddle和torch在这一点的差异没有对错和优劣之分。
paconvert/api_matcher.py
Outdated
kwargs["stream"] = "paddle.device.Stream()" | ||
API_TEMPLATE = textwrap.dedent( | ||
""" | ||
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。
为何 这里会传入 paddle.base.core.CUDAStream,这个是一个很底层的概念,要从源头上解决问题
paconvert/api_matcher.py
Outdated
new_kwargs.update(kwargs) | ||
return GenericMatcher.generate_code(self, new_kwargs) | ||
if "device" in kwargs: | ||
if ":" in kwargs["device"] and "if" not in kwargs["device"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
主要是这种情况:tensor.cuda(device="cuda:0" if cond else "cuda;1")
应该是 if "cuda:" in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skipif这种写法推广开来,然后将if paddle.is_compiled_with_cuda,result=None这些历史trick的写法都删掉
全面适配新写法,不要有了新写法,还留着老的trick
if "device" in kwargs: | ||
if "cuda:" in kwargs["device"] and "if" not in kwargs["device"]: | ||
# case1: tensor.cuda(device="cuda:0") | ||
new_kwargs["device"] = int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个实际参数是“”“ ‘1’ ”“”“ ,int(“”“ ‘1’ ”“”“)报错
tests/test_device.py
Outdated
@@ -29,7 +29,17 @@ def compare( | |||
rtol=1.0e-6, | |||
atol=0.0, | |||
): | |||
assert str(pytorch_result).replace("cuda", "gpu") == str(paddle_result) | |||
if isinstance(paddle_result, bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那你直接判断
if isinstance(paddle_result, str
if isinstance(paddle_result, paddle.CPUPlace
if isinstance(paddle_result, CUDAPlace
这里的判断分支看不太明白
@@ -36,7 +36,8 @@ def compare( | |||
obj = DownloadAPIBase("torch.hub.download_url_to_file") | |||
|
|||
|
|||
def test_case_1(): | |||
# NOTE: Due to network limits, only test case 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是下载速度慢还是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,下载速度很慢,有时候三个case可能会下载40min以上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有没办法通过NO_PROXY、PROXY优化下,现在这个会导致成瓶颈吗
@@ -136,6 +142,9 @@ def test_case_7(): | |||
import torch | |||
x = torch.tensor([[10, 2, 3], [3, 10, 5], [5, 6, 12.]]) | |||
y = torch.tensor([[4, 2, 9], [2, 0, 3], [2, 5, 3.]]) | |||
if torch.cuda.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个要拷贝到cuda上原因是什么bug吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/linalg/lstsq_cn.html 返回值中 :rank 指 x 中矩阵的秩,形状为 (*) 的 Tensor;当 driver 为 'gelsy', 'gelsd', 'gelss' 时,该值会被计算,否则返回空 Tensor。输入参数中”driver “:CPU 下该参数的合法值为 'gels','gelsy' (默认),'gelsd','gelss';CUDA 下该参数的合法值为 'gels' (默认) 。因此在GPU计算模式下,且未指定driver参数时,返回值为空。torch中如果没有设置默认设备会默认使用cpu设备,从而使用CPU模式计算,此时会计算rank,此时就会导致diff,单测失败
paconvert/api_matcher.py
Outdated
# So kwargs["stream"] must be paddle.device.Stream, not paddle.base.core.CUDAStream. | ||
API_TEMPLATE = textwrap.dedent( | ||
""" | ||
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.cuda.Stream对应的paddle的API已修改为paddle.device.Stream。之前对应paddle.device.cuda.Stream是将要废弃的API,在部分使用场景没做适配,导致了本次转换场景的不兼容。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前CI好像没通过,字符串写法尽可能写长度短一点
# case 5: device = 0 if cond else 1 | ||
kwargs[ | ||
"device" | ||
] = f'"gpu:"+str({kwargs["device"]}) if isinstance({kwargs["device"]}, int) else str({kwargs["device"]}).replace("cuda", "gpu")' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换个写法:
f'"gpu:{kwargs["device"]}" if isinstance({kwargs["device"]}, int) else "{kwargs["device"]}".replace("cuda", "gpu")'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于情形:num=2 torch.cuda.Stream(device=num) 这种写法转换的结果 paddle.device.Stream(device='gpu:num' if isinstance(num, int) else 'num'.replace('cuda', 'gpu'), priority=-1 + 2) 是错误的。应该转换为paddle.device.Stream(device='gpu:' + str(num) if isinstance(num,int) else str(num).replace('cuda', 'gpu'), priority=-1 + 2)
@@ -36,7 +36,8 @@ def compare( | |||
obj = DownloadAPIBase("torch.hub.download_url_to_file") | |||
|
|||
|
|||
def test_case_1(): | |||
# NOTE: Due to network limits, only test case 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有没办法通过NO_PROXY、PROXY优化下,现在这个会导致成瓶颈吗
@xuxinyi389 需要把除了test_median的单测都修复好,包括CPU和GPU |
单测均已修复,代理已优化,tests/test_hub_download_url_to_file.py 相关单测已经解禁。 |
@xuxinyi389 后面优化下GPU CI的时间,目前时间太长了,比如安装Pytorch就花了10多min |
PR Docs
支持gpu ci,修复与cuda相关的单测。
PR APIs