New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gemm
throws exception on PVC
#308
Comments
The (base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm_usm
########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
# gemm
#
# Supported floating point type precisions:
# float
# double
#
########################################################################
Running tests on GPU.
Running with single precision real data type:
Caught synchronous SYCL exception during GEMM:
Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_incopy
OpenCL status: 1
GEMM parameters:
transA = trans, transB = nontrans
m = 45, n = 98, k = 67
lda = 103, ldB = 105, ldC = 106
alpha = 2, beta = 3
Outputting 2x2 block of A,B,C matrices:
A = [ 0.340188, 0.260249, ...
[ -0.105617, 0.0125354, ...
[ ...
B = [ -0.326421, -0.192968, ...
[ 0.363891, 0.251295, ...
[ ...
C = [ 0.400017, 0.310497, ...
[ 0.00257462, -0.0560381, ...
[ ...
# Identical errors are thrown for double and complex as well I've added the example to my minimal reproducer tarball here: oneMKL_gemm_example.tar.gz |
This is actually running fine on Borealis, so I think this might be a configuration issue with ORTCE. I will get in touch with the people who run the cluster. |
@BenBrock Thanks for the logs and update on Borealis. The error you see typically occurs when oneMKL can not detect the GPU architecture (PVC) and uses an alternative code path - which is not functional on PVC. So, that explains why you see the issue on specific machine. As you mentioned, this is probably a configuration issue on ORTCE. Please let us know what you find. |
@mmeterel Could you elaborate on what kind of misconfiguration causes this? I'm working on making oneAPI.jl support PVC hardware, however we're seeing a similar issue:
As you can see, this code is being called from a oneMKL wrapper library (
We're probably doing something wrong here, because the MWE provided above works fine when using the system MKL (from oneAPI 2024.0, same as what we use for building |
@maleadt It is hard to tell what is going wrong from the logs you sent. Can you please clarify your last paragraph? In your working configuration, are you using DPCPP compiler and oneMKL bits from the same 2024.0 base tool kit release? If yes, what is different in your non-working version? (Compiler? oneMKL?) Also, what is the driver version you are using? (You can share the results of |
Hi @maleadt - thanks for your work on oneAPI.jl! Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used. Could you please install it and see if that resolves the issue? |
I'm using the tools and libraries that are provisioned by the image on IDC, which according to the website seems to be: Ubuntu 22.04 LTS (Jammy Jellyfish) v20240129, oneAPI base kit 2024.0.1, oneAPI HPC kit 2024.0.1 and oneAPI render kit 2024.0.0
I'm using 2024.0.0 from Conda for my wrapper library. That library however isn't built on-device, it's built on a buildbot, and redistributed together with the necessary MKL/SYCL/OpenCL dependencies.
We already redistribute the things that our MKL wrapper library depends on, including libopencl, see https://github.com/JuliaPackaging/Yggdrasil/blob/77c11e9e797db54e68a8cfd83eb9b0d38830e80f/O/oneAPI_Support/build_tarballs.jl#L116-L119. This has been working perfectly on other architectures, except PVC. We aim for the redistributable wrapper library to be fully stand-alone, so that users don't have to install anything to get oneAPI.jl to work. |
Adding @mkrainiuk to this discussion as she is more familiar with the distribution of oneMKL (interfaces) |
Here you are: https://gist.github.com/maleadt/55d9069b5c63e381858dbe64d9f690d3. At first sight, everything looks OK there, and all oneMKL-related resources are loaded from the artifacts directory (i.e. there's no pollution by system libraries). |
There is 'calling init' on the C++ side of the following library that doesn't exist on the Julia side: 128030: calling init: /lib/x86_64-linux-gnu/libze_intel_gpu.so.1. Could it be the problem? |
|
Turns out the issue was with my That said, this error as reported before is inscrutable and should be improved to something actionable. |
@maleadt Thanks for the update and glad to see you found the problem. I should have thought about suggesting IMHO, when the right openCL library is not used from user side, oneMKL-GEMM could still give correct functionality but issue a warning about low performance. Does it sound reasonable? |
Yes, that sounds great. Even a fatal error would be a good option, as long as it comes with an error message that would help diagnose the issue ( |
I would vote for correct functionality + warning. :) |
Summary
I'm trying to use
gemm
on PVC, but it keeps throwing an exception. Please let me know where I'm going wrong.I am attempting to use
gemm
and execute on a 4oam PVC system on ORTCE. I am getting an exception thrown with both productionicpx
and with the most recent version of intel/llvm, both compiled with production oneMKL.A minimal reproducer is attached below.
This throws the following exception:
As far as I can tell, I am allocating enough memory, and all of the pointers I'm passing in are USM device pointers, which should be accessible on the device associated with the queue passed to oneMKL.
Version
I am using production oneMKL 2023.1.0.
Environment
I am running this on a machine with four PVC GPUs.
I am using production oneMKL 2023.1.0.
I am getting this error with both the most recent commit of intel/llvm and with production
icpx
.(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ icpx --version Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg
Steps to reproduce
Observed behavior
Throws an exception as above.
Expected behavior
I expect the kernel to execute successfully.
oneMKL_gemm.tar.gz
The text was updated successfully, but these errors were encountered: