Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU CI #389

Merged
merged 23 commits into from
May 23, 2024
Merged

Support GPU CI #389

merged 23 commits into from
May 23, 2024

Conversation

xuxinyi389
Copy link
Contributor

@xuxinyi389 xuxinyi389 commented May 6, 2024

PR Docs

支持gpu ci,修复与cuda相关的单测。

PR APIs

部分需要补充的Note:
1.torch.device目前在paddle中没有直接映射,但可以根据实际情况映射成CPUPlace或是CUDAPlace。当前paddle中接受device参数的api,对于device参数类型要求并不一致,接受str、int、CPUPlace、CUDAPlace中的一种或几种。之前对torch.device统一被映射为str,该策略对于current_stream等api转换并不适用。当前对和device相关的策略进行了增强。paddle对于device参数的规范也应做系统性的统一。
2.和cuda_memory相关的api由于框架间内存分配机制不一样,比较大小无意义,重写了compare
3.linalg_lstsq会根据不同情况判断是否计算rank,对相应情形修复
4.nn.module的精度需要做进一步放宽
5.grid_sample使用GPU kernel时会强制stop-gradient为false,对单测进行修复
6.test_distributed_all_gather_object.py移动到分布式单测目录下测试。
7.CI上的v100环境torch.cuda.is_bf16_supported()的结果为True,但本地为False,故使用torch.cuda.get_device_properties替代。
8.更多修复case详见pr

Copy link

paddle-bot bot commented May 6, 2024

Thanks for your contribution!

@xuxinyi389 xuxinyi389 marked this pull request as draft May 7, 2024 06:39
@xuxinyi389 xuxinyi389 marked this pull request as ready for review May 7, 2024 06:39
@xuxinyi389 xuxinyi389 marked this pull request as draft May 7, 2024 06:43
@xuxinyi389 xuxinyi389 marked this pull request as ready for review May 7, 2024 06:43
@xuxinyi389 xuxinyi389 marked this pull request as draft May 7, 2024 06:59
@xuxinyi389 xuxinyi389 marked this pull request as ready for review May 7, 2024 06:59
@xuxinyi389 xuxinyi389 changed the title poolish Support GPU CI May 7, 2024
Copy link
Collaborator

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. matcher的分支都注释一下对应的解决情形
  2. 就维持两条单测流水线吧,原来的CPU流水线继续留着,耗费资源也不大,新的GPU流水线调通为止
  3. 只能在GPU跑的,看能否优化下更好的写法,目前这种result=None是比较trick的写法

paconvert/api_matcher.py Show resolved Hide resolved
paconvert/api_matcher.py Outdated Show resolved Hide resolved
paconvert/api_matcher.py Outdated Show resolved Hide resolved
kwargs["stream"] = "paddle.device.Stream()"
API_TEMPLATE = textwrap.dedent(
"""
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该先将 torch.cuda.Stream 转成 paddle.device.cuda.Stream,是里面这个torch.cuda.Stream的Matcher有问题吗,才会将问题遗留到上一层的API来解决

Copy link
Contributor Author

@xuxinyi389 xuxinyi389 May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。

为何 这里会传入 paddle.base.core.CUDAStream,这个是一个很底层的概念,要从源头上解决问题

paconvert/api_matcher.py Outdated Show resolved Hide resolved
tests/test_cuda_set_device.py Outdated Show resolved Hide resolved
tests/test_cuda_set_stream.py Show resolved Hide resolved
tests/test_cuda_set_device.py Outdated Show resolved Hide resolved
@@ -29,7 +29,17 @@ def compare(
rtol=1.0e-6,
atol=0.0,
):
assert str(pytorch_result).replace("cuda", "gpu") == str(paddle_result)
if isinstance(paddle_result, bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有这么多可能性吗

Copy link
Contributor Author

@xuxinyi389 xuxinyi389 May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的。这个问题主要是因为paddle中并没有和torch.device完全等价的api。不同情况下可能要和paddle.CUDAPlace、int、str相对应,因此在测试相等的时候分支就较多。device的问题也记录到了登记表中

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那你直接判断

if isinstance(paddle_result, str
if isinstance(paddle_result, paddle.CPUPlace
if isinstance(paddle_result, CUDAPlace

这里的判断分支看不太明白

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

@@ -31,7 +31,7 @@ def test_case_1():
result = F.grid_sample(x, grid)
"""
)
obj.run(pytorch_code, ["result"], check_value=False)
obj.run(pytorch_code, ["result"], check_value=False, check_stop_gradient=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是API有bug吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是实现差异,不是bug。paddle中检测到使用cudnn实现时,会强制将stop_gradient设置为False。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是实现差异,不是bug。paddle中检测到使用cudnn实现时,会强制将stop_gradient设置为False。

问题是这个设置为False的实现是合理的吗?凡是有差异的地方,先要从API功能语义上推敲是否合理,这里说的 实现差异 需要有一个更合理的定义

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle和pytorch的grid_sample的GPU实现实现都是调用cudnn的接口,cudnn内部的反向实现会同时计算所有输入的梯度,无法mask只计算部分输入的梯度。因此paddle选择在检测到使用cudnn实现时, PaddlePaddle/Paddle@7de2db4 同时将输入x,grid的stop_gradient设置为False。 pytorch在grid_sample反向实现时也注明了反向会同时计算所有输入的梯度,但其没选择在前向做手工设置
https://github.com/pytorch/pytorch/blob/2ed17e0b1ec0ca2a5dea41f81a63f582f2792d22/aten/src/ATen/native/cudnn/GridSampler.cpp#L125 。paddle出于cudnn该实现的特点,将stop_gradient 人为设置为False也是合理的。paddle和torch在这一点的差异没有对错和优劣之分。

tests/test_cuda_set_device.py Outdated Show resolved Hide resolved
tests/test_cuda_set_device.py Outdated Show resolved Hide resolved
paconvert/api_matcher.py Outdated Show resolved Hide resolved
paconvert/api_matcher.py Show resolved Hide resolved
kwargs["stream"] = "paddle.device.Stream()"
API_TEMPLATE = textwrap.dedent(
"""
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.Stream 确实被转换成了 paddle.device.cuda.Stream。这个和他们两没关系。这里这样做的原因是paddle.device.set_stream(stream=)要求stream对象具有stream_base属性,paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream没有stream_base属性,因此stream参数必须是paddle.device.Stream。

为何 这里会传入 paddle.base.core.CUDAStream,这个是一个很底层的概念,要从源头上解决问题

new_kwargs.update(kwargs)
return GenericMatcher.generate_code(self, new_kwargs)
if "device" in kwargs:
if ":" in kwargs["device"] and "if" not in kwargs["device"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要是这种情况:tensor.cuda(device="cuda:0" if cond else "cuda;1")

应该是 if "cuda:" in ?

paconvert/api_matcher.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipif这种写法推广开来,然后将if paddle.is_compiled_with_cuda,result=None这些历史trick的写法都删掉

全面适配新写法,不要有了新写法,还留着老的trick

if "device" in kwargs:
if "cuda:" in kwargs["device"] and "if" not in kwargs["device"]:
# case1: tensor.cuda(device="cuda:0")
new_kwargs["device"] = int(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接int("""1""")好像也可以

infoflow 2024-05-10 18-20-54

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个实际参数是“”“ ‘1’ ”“”“ ,int(“”“ ‘1’ ”“”“)报错

paconvert/api_matcher.py Show resolved Hide resolved
paconvert/api_matcher.py Show resolved Hide resolved
paconvert/api_matcher.py Outdated Show resolved Hide resolved
paconvert/api_matcher.py Outdated Show resolved Hide resolved
tests/test_cuda_set_stream.py Outdated Show resolved Hide resolved
@@ -29,7 +29,17 @@ def compare(
rtol=1.0e-6,
atol=0.0,
):
assert str(pytorch_result).replace("cuda", "gpu") == str(paddle_result)
if isinstance(paddle_result, bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那你直接判断

if isinstance(paddle_result, str
if isinstance(paddle_result, paddle.CPUPlace
if isinstance(paddle_result, CUDAPlace

这里的判断分支看不太明白

@@ -36,7 +36,8 @@ def compare(
obj = DownloadAPIBase("torch.hub.download_url_to_file")


def test_case_1():
# NOTE: Due to network limits, only test case 3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是下载速度慢还是?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,下载速度很慢,有时候三个case可能会下载40min以上

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有没办法通过NO_PROXY、PROXY优化下,现在这个会导致成瓶颈吗

@@ -136,6 +142,9 @@ def test_case_7():
import torch
x = torch.tensor([[10, 2, 3], [3, 10, 5], [5, 6, 12.]])
y = torch.tensor([[4, 2, 9], [2, 0, 3], [2, 5, 3.]])
if torch.cuda.is_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个要拷贝到cuda上原因是什么bug吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/linalg/lstsq_cn.html 返回值中 :rank 指 x 中矩阵的秩,形状为 (*) 的 Tensor;当 driver 为 'gelsy', 'gelsd', 'gelss' 时,该值会被计算,否则返回空 Tensor。输入参数中”driver “:CPU 下该参数的合法值为 'gels','gelsy' (默认),'gelsd','gelss';CUDA 下该参数的合法值为 'gels' (默认) 。因此在GPU计算模式下,且未指定driver参数时,返回值为空。torch中如果没有设置默认设备会默认使用cpu设备,从而使用CPU模式计算,此时会计算rank,此时就会导致diff,单测失败

# So kwargs["stream"] must be paddle.device.Stream, not paddle.base.core.CUDAStream.
API_TEMPLATE = textwrap.dedent(
"""
{}(stream = paddle.device.Stream(stream_base={}) if isinstance ({},(paddle.base.core.CUDAStream, paddle.base.core.CustomDeviceStream)) else {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看起来就是 torch.cuda.Strem 没有转成 paddle.device.Stream,正常的 paddle.device.Stream 必然可以set_stream;从源头上修复

infoflow 2024-05-10 19-11-23

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.Stream对应的paddle的API已修改为paddle.device.Stream。之前对应paddle.device.cuda.Stream是将要废弃的API,在部分使用场景没做适配,导致了本次转换场景的不兼容。

Copy link
Collaborator

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前CI好像没通过,字符串写法尽可能写长度短一点

paconvert/api_matcher.py Outdated Show resolved Hide resolved
# case 5: device = 0 if cond else 1
kwargs[
"device"
] = f'"gpu:"+str({kwargs["device"]}) if isinstance({kwargs["device"]}, int) else str({kwargs["device"]}).replace("cuda", "gpu")'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换个写法:
f'"gpu:{kwargs["device"]}" if isinstance({kwargs["device"]}, int) else "{kwargs["device"]}".replace("cuda", "gpu")'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于情形:num=2 torch.cuda.Stream(device=num) 这种写法转换的结果 paddle.device.Stream(device='gpu:num' if isinstance(num, int) else 'num'.replace('cuda', 'gpu'), priority=-1 + 2) 是错误的。应该转换为paddle.device.Stream(device='gpu:' + str(num) if isinstance(num,int) else str(num).replace('cuda', 'gpu'), priority=-1 + 2)

paconvert/api_matcher.py Show resolved Hide resolved
paconvert/api_matcher.py Show resolved Hide resolved
paconvert/api_matcher.py Show resolved Hide resolved
tests/test_Tensor_cuda.py Outdated Show resolved Hide resolved
tests/test_Tensor_cuda.py Outdated Show resolved Hide resolved
tests/test_Tensor_cuda.py Outdated Show resolved Hide resolved
tests/test_Tensor_cuda.py Outdated Show resolved Hide resolved
tests/test_cuda_set_device.py Outdated Show resolved Hide resolved
tests/test_cuda_set_device.py Show resolved Hide resolved
tests/test_cuda_set_stream.py Outdated Show resolved Hide resolved
@@ -36,7 +36,8 @@ def compare(
obj = DownloadAPIBase("torch.hub.download_url_to_file")


def test_case_1():
# NOTE: Due to network limits, only test case 3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有没办法通过NO_PROXY、PROXY优化下,现在这个会导致成瓶颈吗

@zhwesky2010
Copy link
Collaborator

zhwesky2010 commented May 22, 2024

@xuxinyi389 需要把除了test_median的单测都修复好,包括CPU和GPU

@xuxinyi389
Copy link
Contributor Author

单测均已修复,代理已优化,tests/test_hub_download_url_to_file.py 相关单测已经解禁。

@zhwesky2010
Copy link
Collaborator

@xuxinyi389 后面优化下GPU CI的时间,目前时间太长了,比如安装Pytorch就花了10多min

@zhwesky2010 zhwesky2010 merged commit 834a688 into PaddlePaddle:master May 23, 2024
6 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants