`gemm` throws exception on PVC #308

BenBrock · 2023-04-26T23:41:39Z

Summary

I'm trying to use gemm on PVC, but it keeps throwing an exception. Please let me know where I'm going wrong.

I am attempting to use gemm and execute on a 4oam PVC system on ORTCE. I am getting an exception thrown with both production icpx and with the most recent version of intel/llvm, both compiled with production oneMKL.

A minimal reproducer is attached below.

  sycl::queue q(sycl::default_selector_v);

  T* a_d = sycl::malloc_device<T>(m * k, q);
  T* b_d = sycl::malloc_device<T>(k * n, q);
  T* c_d = sycl::malloc_device<T>(m * n, q);
  
  std::vector<T> a_l(m*k);
  std::vector<T> b_l(k*n);
  std::vector<T> c_l(m*n, 0);

  for (std::size_t i = 0; i < m*k; i++) {
    a_l[i] = drand48();
  }

  for (std::size_t i = 0; i < k*n; i++) {
    b_l[i] = drand48();
  }

  q.memcpy(a_d, a_l.data(), m*k*sizeof(T)).wait();
  q.memcpy(b_d, b_l.data(), k*n*sizeof(T)).wait();
  q.memcpy(c_d, c_l.data(), m*n*sizeof(T)).wait();

  std::cout << "Running MKL gemm..." << std::endl;

  auto event = oneapi::mkl::blas::row_major::gemm(q,
    oneapi::mkl::transpose::nontrans,
    oneapi::mkl::transpose::nontrans,
    m, n, k,
    T(1),
    a_d, k,
    b_d, n,
    T(1),
    c_d, n);
  event.wait();

This throws the following exception:

(base) bbrock@sdp4452:~/src/issues/oneMKL_gemm$ ./gemm
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

As far as I can tell, I am allocating enough memory, and all of the pointers I'm passing in are USM device pointers, which should be accessible on the device associated with the queue passed to oneMKL.

Version

I am using production oneMKL 2023.1.0.

Environment

I am running this on a machine with four PVC GPUs.

(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel (R) Xeon (R) CPU Max 9480 OpenCL 3.0 (Build 0) [2023.15.3.0.20_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]

I am using production oneMKL 2023.1.0.

I am getting this error with both the most recent commit of intel/llvm and with production icpx.

(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg

Steps to reproduce

(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)

Observed behavior

Throws an exception as above.

Expected behavior

I expect the kernel to execute successfully.

oneMKL_gemm.tar.gz

The text was updated successfully, but these errors were encountered:

BenBrock · 2023-04-27T17:43:34Z

The gemm_usm example included with production oneMKL also throws the same error on PVC.

(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm_usm

########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example: 
# 
# C = alpha * A * B + beta * C
# 
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
# 
# Using apis:
#   gemm
# 
# Supported floating point type precisions:
#   float
#   double
# 
########################################################################

Running tests on GPU.
	Running with single precision real data type:
		Caught synchronous SYCL exception during GEMM:
Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_incopy
OpenCL status: 1

		GEMM parameters:
			transA = trans, transB = nontrans
			m = 45, n = 98, k = 67
			lda = 103, ldB = 105, ldC = 106
			alpha = 2, beta = 3

		Outputting 2x2 block of A,B,C matrices:

			A = [ 0.340188, 0.260249, ...
			    [ -0.105617, 0.0125354, ...
			    [ ...


			B = [ -0.326421, -0.192968, ...
			    [ 0.363891, 0.251295, ...
			    [ ...


			C = [ 0.400017, 0.310497, ...
			    [ 0.00257462, -0.0560381, ...
			    [ ...

# Identical errors are thrown for double and complex as well

I've added the example to my minimal reproducer tarball here: oneMKL_gemm_example.tar.gz

BenBrock · 2023-04-27T18:17:19Z

This is actually running fine on Borealis, so I think this might be a configuration issue with ORTCE. I will get in touch with the people who run the cluster.

mmeterel · 2023-05-04T15:50:15Z

@BenBrock Thanks for the logs and update on Borealis. The error you see typically occurs when oneMKL can not detect the GPU architecture (PVC) and uses an alternative code path - which is not functional on PVC. So, that explains why you see the issue on specific machine. As you mentioned, this is probably a configuration issue on ORTCE. Please let us know what you find.

maleadt · 2024-03-29T16:11:37Z

@mmeterel Could you elaborate on what kind of misconfiguration causes this? I'm working on making oneAPI.jl support PVC hardware, however we're seeing a similar issue:

terminate called after throwing an instance of 'sycl::_V1::exception'
what():  Level-Zero error:700000041879048196
On device: 'Intel(R) Data Center GPU Max 1550'
in kernel: oneapi::mkl::blas::sgemm_itcopy
      From worker 16:
[85716] signal (6.-6): Aborted
in expression starting at none:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__verbose_terminate_handler at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
__terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
__cxa_throw at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_throw.cc:98
_ZN6oneapi3mkl3gpu13build_programEPiPN4sycl3_V15queueEPvS7_iPKcS9_mcS9_Pb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpuL22mkl_gpu_get_kernel_extEPiPN4sycl3_V15queueEiPKcS8_mcS8_S8_S8_mPKvmbb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu24mkl_gpu_get_spirv_kernelEPiPN4sycl3_V15queueEiPK22mkl_gpu_spirv_kernel_tPKcSB_ at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu40mkl_blas_gpu_sgemm_copybased_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu30mkl_blas_gpu_sgemm_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu19sgemm_sycl_internalEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu10sgemm_syclEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas5sgemmERN4sycl3_V15queueE10MKL_LAYOUTNS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS3_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas12column_major4gemmERN4sycl3_V15queueENS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS4_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
onemklSgemm at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so (unknown line)

As you can see, this code is being called from a oneMKL wrapper library (liboneapi_support) to work around the lack of C API in oneMKL. We build and distribute this library ourselves, along with the required MKL and SYCL libraries, downloaded from Conda:

We're probably doing something wrong here, because the MWE provided above works fine when using the system MKL (from oneAPI 2024.0, same as what we use for building liboneapi_support). I'm doing this on IDC, using a Max 1550.

mmeterel · 2024-04-01T15:49:48Z

@maleadt It is hard to tell what is going wrong from the logs you sent. Can you please clarify your last paragraph? In your working configuration, are you using DPCPP compiler and oneMKL bits from the same 2024.0 base tool kit release? If yes, what is different in your non-working version? (Compiler? oneMKL?)

Also, what is the driver version you are using? (You can share the results of sycl-ls)

sknepper · 2024-04-01T16:40:32Z

Hi @maleadt - thanks for your work on oneAPI.jl! Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used. Could you please install it and see if that resolves the issue?

maleadt · 2024-04-03T07:28:07Z

In your working configuration, are you using DPCPP compiler and oneMKL bits from the same 2024.0 base tool kit release?

I'm using the tools and libraries that are provisioned by the image on IDC, which according to the website seems to be: Ubuntu 22.04 LTS (Jammy Jellyfish) v20240129, oneAPI base kit 2024.0.1, oneAPI HPC kit 2024.0.1 and oneAPI render kit 2024.0.0

If yes, what is different in your non-working version? (Compiler? oneMKL?)

I'm using 2024.0.0 from Conda for my wrapper library. That library however isn't built on-device, it's built on a buildbot, and redistributed together with the necessary MKL/SYCL/OpenCL dependencies.

Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used.

We already redistribute the things that our MKL wrapper library depends on, including libopencl, see https://github.com/JuliaPackaging/Yggdrasil/blob/77c11e9e797db54e68a8cfd83eb9b0d38830e80f/O/oneAPI_Support/build_tarballs.jl#L116-L119. This has been working perfectly on other architectures, except PVC. We aim for the redistributable wrapper library to be fully stand-alone, so that users don't have to install anything to get oneAPI.jl to work.

mmeterel · 2024-04-03T16:33:29Z

Adding @mkrainiuk to this discussion as she is more familiar with the distribution of oneMKL (interfaces)

pengtu · 2024-04-10T17:43:15Z

@maleadt : Can you run the program with LD_DEBUG=libs with both the failing and working versions, as the problem is likely due to different OpenCL library is invoked at runtime?

Also add @kballeda to the thread.

maleadt · 2024-04-10T18:37:11Z

Here you are: https://gist.github.com/maleadt/55d9069b5c63e381858dbe64d9f690d3. At first sight, everything looks OK there, and all oneMKL-related resources are loaded from the artifacts directory (i.e. there's no pollution by system libraries).

pengtu · 2024-04-10T20:34:25Z

There is 'calling init' on the C++ side of the following library that doesn't exist on the Julia side: 128030: calling init: /lib/x86_64-linux-gnu/libze_intel_gpu.so.1. Could it be the problem?

maleadt · 2024-04-11T06:54:16Z

libze_intel_gpu is there on the Julia side too, but it's loaded earlier (when oneAPI.jl loads):

calling init: /home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1

maleadt · 2024-04-15T08:31:08Z

Turns out the issue was with my libOpenCL.so, which I took from intel-opencl-rt on Conda, somehow did not support or detect my PVC hardware, even resulting in clinfo returning in 0 platforms. After switching to Khronos' ICD loader, MKL works fine.

That said, this error as reported before is inscrutable and should be improved to something actionable.

mmeterel · 2024-04-15T15:49:08Z

@maleadt Thanks for the update and glad to see you found the problem. I should have thought about suggesting clinfo check! Sorry about that.

IMHO, when the right openCL library is not used from user side, oneMKL-GEMM could still give correct functionality but issue a warning about low performance. Does it sound reasonable?

maleadt · 2024-04-15T17:46:26Z

oneMKL-GEMM could still give correct functionality but issue a warning about low performance. Does it sound reasonable?

Yes, that sounds great. Even a fatal error would be a good option, as long as it comes with an error message that would help diagnose the issue (No OpenCL device detected, or whatever).

mmeterel · 2024-04-15T20:37:22Z

I would vote for correct functionality + warning. :)
Is it ok to close this issue?

maleadt mentioned this issue Mar 29, 2024

PVC support JuliaGPU/oneAPI.jl#406

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`gemm` throws exception on PVC #308

`gemm` throws exception on PVC #308

BenBrock commented Apr 26, 2023 •

edited

BenBrock commented Apr 27, 2023

BenBrock commented Apr 27, 2023

mmeterel commented May 4, 2023

maleadt commented Mar 29, 2024

mmeterel commented Apr 1, 2024

sknepper commented Apr 1, 2024

maleadt commented Apr 3, 2024

mmeterel commented Apr 3, 2024

pengtu commented Apr 10, 2024

maleadt commented Apr 10, 2024

pengtu commented Apr 10, 2024

maleadt commented Apr 11, 2024

maleadt commented Apr 15, 2024

mmeterel commented Apr 15, 2024

maleadt commented Apr 15, 2024

mmeterel commented Apr 15, 2024

gemm throws exception on PVC #308

gemm throws exception on PVC #308

Comments

BenBrock commented Apr 26, 2023 • edited

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

BenBrock commented Apr 27, 2023

BenBrock commented Apr 27, 2023

mmeterel commented May 4, 2023

maleadt commented Mar 29, 2024

mmeterel commented Apr 1, 2024

sknepper commented Apr 1, 2024

maleadt commented Apr 3, 2024

mmeterel commented Apr 3, 2024

pengtu commented Apr 10, 2024

maleadt commented Apr 10, 2024

pengtu commented Apr 10, 2024

maleadt commented Apr 11, 2024

maleadt commented Apr 15, 2024

mmeterel commented Apr 15, 2024

maleadt commented Apr 15, 2024

mmeterel commented Apr 15, 2024

`gemm` throws exception on PVC #308

`gemm` throws exception on PVC #308

BenBrock commented Apr 26, 2023 •

edited