Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support equivalent of cublasCherkEx() #1320

Open
torrance opened this issue May 30, 2023 · 6 comments
Open

[Feature]: Support equivalent of cublasCherkEx() #1320

torrance opened this issue May 30, 2023 · 6 comments
Assignees

Comments

@torrance
Copy link

torrance commented May 30, 2023

Is your feature request related to a problem? Please describe.

It's common to have large, low-precision input matrices that you'd like to multiply at full internal precision using rocblas<t>gemm(), possibly (but not necessarily) with output at full precision.

Describe the solution you'd like

Support the equivalent of cublas<t>gemmEx() as described here: https://docs.nvidia.com/cuda/cublas/#cublas-gemmEx

Describe alternatives you've considered

An alternative is to copy the input matrices to double precision first. If the output is not required at full precision, a further copy must be made and the precision truncated. This alternative doubles memory pressure on the GPU and causes extra copying of memory.

@TorreZuk TorreZuk self-assigned this May 30, 2023
@TorreZuk
Copy link
Contributor

Thanks for your report @torrance. rocBLAS supports the equivalent of cublasgemmEx with the function rocblas_gemm_ex described here: https://rocm.docs.amd.com/projects/rocBLAS/en/latest/API_Reference_Guide.html#rocblas-gemm-ex-batched-strided-batched It implements numerous mixed precision and high precision accumulations (HPA) so please review it. If it is missing one you require please provide a list of specific missing data types for inputs, output and compute, in the order of your interest (describing your use case is also helpful). Based on your feedback we can consider adding additional ones but the most common forms should already be implemented.

@torrance
Copy link
Author

torrance commented May 31, 2023

@TorreZuk Thank you! HIPIFY complained there was no suitable equivalent and I clearly didn't spend long enough verifying that.

If I can hijack my own issue (!), what about a hipblas/rocblas equivalent to cublasCherkEx()? My searching of the documentation (as well as HIPIFY) seem to suggest not, and it's a bit of a stickler to the conversion of this codebase.

@TorreZuk TorreZuk changed the title [Feature]: Support equivalent of cublas<t>gemmEx() [Feature]: Support equivalent of cublasCherkEx() May 31, 2023
@TorreZuk
Copy link
Contributor

Sure we can recycle this for request of an equivalent to cublasCherkEx() which is a new feature request. Can ask if @emankov has any insights into cublasgemmEx() hipify mapping to rocblas_gemm_ex but for all the argument datatype enums maybe those have to be manually chosen?

@amcamd
Copy link
Contributor

amcamd commented Jun 5, 2023

Hello @torrance,
cublasCherkEx() supports CUDA_C_8I datatype for matrix A. This is a complex number with two 8 bit signed integers. I have some questions about this datatype:

  • Do you require support for CUDA_C_8I in cublasCgemmEx() as well as in cublasCherkEx(). We have a rocblas_gemm_ex function, it supports real 8 bit integers but not complex 8 bit integers.
  • Can you say what application is using this CUDA_C_8I datatype? Real 8 bit integers are used in machine learning, what is the use case for complex 8 bit integers?

Thanks Andrew

@torrance
Copy link
Author

torrance commented Jun 9, 2023

Hi @amcamd

Can you say what application is using this CUDA_C_8I datatype? Real 8 bit integers are used in machine learning, what is the use case for complex 8 bit integers?

Yes, they are needed. Lots of radio astronomy correlators record observations of the sky as simple 8 bit complex integers, which can later be normalised as part of calibration. The 8 bits integer representation has the advantage of having constant deltas between values, as opposed to floating representation. At the high end, we let the integer representation 'saturate' and later flag these values. They are also necessarily complex, since radio astronomy works in the Fourier domain.

We want to avoid converting these to higher precision values because these values make up the raw data of our observations and are absolutely massive in size.

Hope this helps give some context.

@amcamd
Copy link
Contributor

amcamd commented Jun 9, 2023

Hi @torrance ,
Thank you for the context and the use case. I was guessing this is related to radio astronomy and the installations you have in Western Australia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants