Add new batch_gemm types #466

AidanBeltonS · 2024-04-03T14:17:56Z

Description

This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.

New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.

Test logs:
rocblas_test_log.txt
cublas_test_log.txt

I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.

Fixes # (GitHub issue)
#446

Checklist

All Submissions

Do all unit tests pass locally? Attach a log.
[x]Have you formatted the code using clang-format?

New interfaces

Have you provided motivation for adding a new feature as part of RFC and
it was accepted? # (RFC)

New features

Have you provided motivation for adding a new feature?
Have you added relevant tests?

AidanBeltonS · 2024-04-03T14:18:34Z

@Rbiessy, cc

mmeterel · 2024-04-03T16:24:13Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

hjabird

Since the reference cblas implementation doesn't support some of the operations that are being added (as I understand it), is the new functionality actually tested?

src/blas/backends/cublas/cublas_helper.hpp

src/blas/backends/portblas/portblas_batch.cxx

src/blas/backends/rocblas/rocblas_helper.hpp

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

Rbiessy · 2024-04-04T12:37:05Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

src/blas/backends/portblas/portblas_batch.cxx

mmeterel · 2024-04-04T15:41:48Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.
"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

mmeterel · 2024-04-04T16:09:12Z

@andrewtbarker Will you be able to help with this review?

andrewtbarker · 2024-04-04T16:33:11Z

@andrewtbarker Will you be able to help with this review?

Sure, I will take a look.

andrewtbarker

Thanks for the PR, there is a lot of good work here. Most of my comments are just about style and naming consistency.

src/blas/backends/cublas/cublas_batch.cpp

src/blas/backends/rocblas/rocblas_batch.cpp

src/blas/function_table.hpp

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

tests/unit_tests/blas/batch/gemm_batch_stride_usm.cpp

tests/unit_tests/blas/batch/gemm_batch_usm.cpp

andrewtbarker · 2024-04-04T22:33:16Z

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

Yes, this should have been fixed in #445 . If not we should fix it.

mmeterel · 2024-04-04T22:55:13Z

Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs?

AidanBeltonS · 2024-04-08T15:54:09Z

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases.
mkl_test_log.txt
netlib_test_log.txt
port_blas_test_logs.txt

MKL tests error:
mkl_test_log.txt

andrewtbarker · 2024-04-10T17:12:11Z

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

AidanBeltonS · 2024-04-12T13:48:33Z

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

The failures it Dot are due to error

[ RUN      ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
relative error = 1.83849e-08 absolute error = 1.24863e-07 limit = 3.01315e-13
Difference in result: DPC++ 6.79159 vs. Reference 6.79159
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/level1/dot.cpp:157: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (1 ms)

DotU is an odd one, it does not appear to be related to my changes however

[ RUN      ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
Caught synchronous SYCL exception during DOTU:
The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Max 1100':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
OpenCL status: sycl:7
unknown file: Failure
C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
[  FAILED  ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (0 ms)

AidanBeltonS · 2024-04-12T13:49:11Z

I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled?

[ RUN      ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y
relative error = 0.000911658 absolute error = 0.00168478 limit = 0.000333786
Difference in entry (58,119): DPC++ 1.84973 vs. Reference 1.84804
relative error = 0.000812301 absolute error = 0.00121021 limit = 0.000333786
Difference in entry (0,124): DPC++ 1.49107 vs. Reference 1.48986
relative error = 0.000534697 absolute error = 0.000857353 limit = 0.000333786
Difference in entry (17,144): DPC++ 1.60258 vs. Reference 1.60344
relative error = 0.000527185 absolute error = 0.00049144 limit = 0.000333786
Difference in entry (52,186): DPC++ -0.932689 vs. Reference -0.932197
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/batch/gemm_batch_usm.cpp:408: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y, where GetParam() = (0x56845d0, 1-byte object <01>) (331 ms)

mmeterel · 2024-04-12T16:02:25Z

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt

MKL tests error: mkl_test_log.txt

Can you please test hipSYCL backend as well?

andrewtbarker

Some suggestions to make interpreting failed test results easier - I flagged a few places but there are similar issues in most of the new tests.

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

andrewtbarker · 2024-04-12T16:46:24Z

Does anyone have any recommendations on how they would like to see this handled?

If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from tests/unit_tests/blas/include/reference_blas_templates.hpp we call (might need to add a reference gemm in that file).

andrewtbarker · 2024-05-01T23:39:12Z

What is the status here? As I see it we have three outstanding items:

AdaptiveCpp testing
Test names (my most recent review, minor)
Failure in RealIntRealScalarPrecision

(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

Rbiessy · 2024-05-02T10:01:53Z

Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week!

AidanBeltonS · 2024-05-13T15:48:41Z

What is the status here? As I see it we have three outstanding items:
1. AdaptiveCpp testing

2. Test names (my most recent review, minor)

3. Failure in `RealIntRealScalarPrecision`
(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

I have addressed items 2. and 3.
To resolve 3 I am scaling the tolerance by the possible input range from int8 matricies. i.e. 256
I have yet to test this with AdaptiveCpp, Ill start looking at that shortly

Rbiessy · 2024-05-13T16:25:48Z

src/include/error_helper.hpp

+#if __has_include(<sycl/sycl.hpp>)
+#include <sycl/sycl.hpp>
+#else
+#include <CL/sycl.hpp>
+#endif


This doesn't need SYCL, including <string> is enough.
Also I think dtype_string.hpp would be a better name for the file.

andrewtbarker

I think we're good to go. Thanks for sticking with this one!

Rbiessy self-assigned this Apr 3, 2024

Rbiessy requested a review from mmeterel April 3, 2024 15:23

hjabird self-assigned this Apr 3, 2024

hjabird reviewed Apr 4, 2024

View reviewed changes

Rbiessy reviewed Apr 4, 2024

View reviewed changes

src/blas/backends/portblas/portblas_batch.cxx Outdated Show resolved Hide resolved

andrewtbarker reviewed Apr 4, 2024

View reviewed changes

andrewtbarker mentioned this pull request Apr 4, 2024

What is the supported precision for gemm in oneMKL open source project ? #310

Open

AidanBeltonS added 13 commits April 5, 2024 10:28

Add new interface

8259fec

Add new dtype testing for gemm_batch

a860a3f

Add new gemm_batch dtypes to cuBlas

4bd07ab

Add new gemm_batch dtypes to rocBlas

d6ea4fd

Add new gemm_batch dtypes to mklcpu/gpu

bd97015

Add gemm_batch dtypes (unimplemented)

ef18cba

Add gemm_batch dtypes to netlib (unimplemented)

78368ca

Fix typo

f11e36c

Fix spelling

9ba1a51

Change naming convention

1629a36

Update tests

a8a26ba

Add more descriptive throw

f7d5ae8

Clang-foramt

20a7057

AidanBeltonS force-pushed the add_new_batch_gemm_types branch from e436f1c to 20a7057 Compare April 8, 2024 15:38

Undo mistake

798dfe5

Fix allocator msitake

e2c72e0

Add check matrix instantiation

f00a46a

andrewtbarker reviewed Apr 12, 2024

View reviewed changes

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

AidanBeltonS added 3 commits May 13, 2024 16:32

Add src/include to rocBlas include path

e4a2549

Increase tolerancing for int8_t inputs

65969a2

Change test names

7e915c3

Rbiessy reviewed May 13, 2024

View reviewed changes

Rbiessy mentioned this pull request May 14, 2024

[BLAS] Netlib backend fails to compile with AdaptiveCpp #485

Open

andrewtbarker approved these changes May 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new batch_gemm types #466

Add new batch_gemm types #466

AidanBeltonS commented Apr 3, 2024

AidanBeltonS commented Apr 3, 2024

mmeterel commented Apr 3, 2024

hjabird left a comment

Rbiessy commented Apr 4, 2024

mmeterel commented Apr 4, 2024

mmeterel commented Apr 4, 2024

andrewtbarker commented Apr 4, 2024

andrewtbarker left a comment

andrewtbarker commented Apr 4, 2024

mmeterel commented Apr 4, 2024

AidanBeltonS commented Apr 8, 2024

andrewtbarker commented Apr 10, 2024

AidanBeltonS commented Apr 12, 2024

AidanBeltonS commented Apr 12, 2024

mmeterel commented Apr 12, 2024

andrewtbarker left a comment

andrewtbarker commented Apr 12, 2024

andrewtbarker commented May 1, 2024 •

edited

Rbiessy commented May 2, 2024

AidanBeltonS commented May 13, 2024

Rbiessy May 13, 2024

andrewtbarker left a comment

Add new batch_gemm types #466

Are you sure you want to change the base?

Add new batch_gemm types #466

Conversation

AidanBeltonS commented Apr 3, 2024

Description

Checklist

All Submissions

New interfaces

New features

AidanBeltonS commented Apr 3, 2024

mmeterel commented Apr 3, 2024

hjabird left a comment

Choose a reason for hiding this comment

Rbiessy commented Apr 4, 2024

mmeterel commented Apr 4, 2024

mmeterel commented Apr 4, 2024

andrewtbarker commented Apr 4, 2024

andrewtbarker left a comment

Choose a reason for hiding this comment

andrewtbarker commented Apr 4, 2024

mmeterel commented Apr 4, 2024

AidanBeltonS commented Apr 8, 2024

andrewtbarker commented Apr 10, 2024

AidanBeltonS commented Apr 12, 2024

AidanBeltonS commented Apr 12, 2024

mmeterel commented Apr 12, 2024

andrewtbarker left a comment

Choose a reason for hiding this comment

andrewtbarker commented Apr 12, 2024

andrewtbarker commented May 1, 2024 • edited

Rbiessy commented May 2, 2024

AidanBeltonS commented May 13, 2024

Rbiessy May 13, 2024

Choose a reason for hiding this comment

andrewtbarker left a comment

Choose a reason for hiding this comment

andrewtbarker commented May 1, 2024 •

edited