Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

Open
SirYwell opened this issue Apr 29, 2024 · 14 comments
Open

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

SirYwell opened this issue Apr 29, 2024 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@SirYwell
Copy link
Contributor

Describe the bug

A clear and concise description of what the bug is.

I'm getting the following error:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11
uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL. 

when trying to run the MatrixMultiplication example using Docker.

My current assumption is that this is caused by my GPU not supporting FP16 (from my understanding, output of clinfo here: https://gist.github.com/SirYwell/bdc347db5c4b5f66e2c664666fb0313f), but TornadoVM unconditionally emitting

As I haven't yet figured out how I can build and run TornadoVM on NixOS directly, I can't test if changing that already fixes the problem.

How To Reproduce

After cloning the docker-tornado repository, I'm running ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk

The output can be found here: https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820

Note: I made a few adjustments to make things work:
pom.xml: I changed the version from tornado-api and tornado-matrices from 1.0.3-dev to 1.0.3.
MatrixMultiplication.java: I changed WARMING_UP_ITERATIONS from 100 to 1 to reduce output.
run_nvidia_openjdk.sh: I changed #!/bin/bash to #!/usr/bin/env bash to make it work on NixOS. I also added the --gpus all flag to the docker command.

Expected behavior

Expecting the example to run.

Computing system setup (please complete the following information):

Output of ./run_nvidia_openjdk.sh tornado --devices:

./run_nvidia_openjdk.sh tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 1
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 2070 SUPER
                Global Memory Size: 7.8 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

Output of ./run_nvidia_openjdk.sh tornado --version:

version=1.0.3
branch=master
commit=02cb3f7

Backends installed: 
         - opencl
  • OS: NixOS 23.11, but running via Docker as provided

Additional context

I'm using the nvidia-container-toolkit as described in beehive-lab/docker-tornadovm#8, but I don't think it is related to this issue.


@jjfumero
Copy link
Member

Ok, sorry I was part of it since I jump first to the docker-tornado repo. I will take a look and let you know.

@jjfumero jjfumero self-assigned this Apr 29, 2024
@jjfumero jjfumero added the bug Something isn't working label Apr 29, 2024
@jjfumero
Copy link
Member

jjfumero commented Apr 29, 2024

My take here is that it is related to the thread-block being used.

Task info: s0.t0
	Backend           : OPENCL
	Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [512, 512]
	Local  work size  : [32, 32, 1]
	Number of workgroups  : [16, 16]

[TornadoVM-OCL-JNI] ERROR : [JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
clEnqueueNDRangeKernel -> Returned: [JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

Tornado does not fail during compilation, but during the kernel dispatch.
See related issue (thread conversation): #356

If I select blocks of 16x16, then I get correct executions:

Task info: s0.t0
	Backend           : OPENCL
	Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [512, 512, 1]
	Local  work size  : [16, 16, 1]
	Number of workgroups  : [32, 32, 1]

The diff:

diff --git a/example/pom.xml b/example/pom.xml
index 30c2487..785dc2d 100644
--- a/example/pom.xml
+++ b/example/pom.xml
@@ -29,12 +29,12 @@
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-api</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-matrices</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>junit</groupId>
diff --git a/example/src/main/java/example/MatrixMultiplication.java b/example/src/main/java/example/MatrixMultiplication.java
index a37b0d6..d97439e 100644
--- a/example/src/main/java/example/MatrixMultiplication.java
+++ b/example/src/main/java/example/MatrixMultiplication.java
@@ -19,6 +19,7 @@ package example;
 import java.util.Random;
 import java.util.stream.IntStream;
 
+import uk.ac.manchester.tornado.api.*;
 import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
 import uk.ac.manchester.tornado.api.TaskGraph;
 import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
@@ -88,7 +89,12 @@ public class MatrixMultiplication {
 
         ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
         TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
-        executor.withWarmUp();
+
+        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+        workerGrid.setLocalWork(16, 16, 1);
+
+        executor.withGridScheduler(gridScheduler).withWarmUp();
 
         // 1. Warm up Tornado
         for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {
diff --git a/example/target/example-1.0-SNAPSHOT.jar b/example/target/example-1.0-SNAPSHOT.jar
index df83b80..9fff7be 100644
Binary files a/example/target/example-1.0-SNAPSHOT.jar and b/example/target/example-1.0-SNAPSHOT.jar differ

Funny, beacuse on my native system, I can run with blocks of 32x32, so this is a driver issue within the Docker image.

@jjfumero
Copy link
Member

I pushed the "fix" in the tornado-docker repo: beehive-lab/docker-tornadovm@cb5f48c

@jjfumero
Copy link
Member

For reference:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication 1024
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 1024x1024
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
	Single Threaded CPU Execution: 0.92 GFlops, Total time = 2326 ms
	Streams Execution: 10.28 GFlops, Total time = 209 ms
	TornadoVM Execution on GPU (Accelerated): 1073.74 GFlops, Total Time = 2 ms
	Speedup: 1163.0x
	Verification true

I am using a new docker image for a new TornadoVM version coming up tomorrow ;-)

@SirYwell
Copy link
Contributor Author

Thanks for looking into it so quickly. I'll try out tomorrow with the new version and report back!

@jjfumero
Copy link
Member

@SirYwell
Copy link
Contributor Author

Hi @jjfumero, I tried with the new docker image, but I'm still getting the same error.
Running ./run_nvidia_openjdk.sh tornado --version now outputs

version=1.0.4
branch=master
commit=585574e

Backends installed: 
         - opencl

so I assume it's running the correct version.

Is there any way I can get more information out of OpenCL? Or is there any information I can provide that helps you? It's also not completely unrealistic that there is something wrong due to NixOS, but my understanding how things interact with each other isn't deep enough here. (I'm currently trying to manually build TornadoVM from source, but it fails due to cmake not finding the jni/jawt header files, which I also don't understand as they are present)

@jjfumero
Copy link
Member

The TornadoVM automatic installer should bring all necessary dependencies (Java, cmake, maven, etc).

I do not know if there is something specific for NixOS, but to install TonadoVM from source, can you try the following:

Assuming you want the OpenCL backend and you have the NVIDIA Driver installed for your NVIDIA GPU:

$ ./bin/tornadovm-installer  --jdk jdk21 --backend=opencl

@stratika
Copy link
Collaborator

In my system (Ubuntu 23.10), I tested the script that failed for you and it is working:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 512x512
[INFO] Loading Backend: uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider@52aa2946
TornadoGraph dependency matrix...
+----+---------------+
|  5 [data]| <none>
|----+---------------+
|  6 [data]| <none>
|----+---------------+
|  7 [data]| <none>
|----+---------------+
|  8 [data]| <none>
|----+---------------+
|  9 [data]| 10
|----+---------------+
| 10 [task]| 6 7 8
|----+---------------+
| 11 [data]| 10
|----+---------------+
| 12 [data]| 11
|----+---------------+
| 13 [data]| 11
|----+---------------+
| 14 [data]| 11
|----+---------------+

-----------------------------------
Device Table:
[0]:  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
Constant Table:
[0]: 512
Object Table:
[0]: 0x16f7c8c1 MatrixFloat <512 x 512>
[1]: 0x573906eb MatrixFloat <512 x 512>
[2]: 0x4ebff610 MatrixFloat <512 x 512>
Task Table:
[0]: task s0.t0 - matrixMultiplication
-----------------------------------
-----------------------------------
TaskGraph:
[0]: constant 0
[1]: object 0
[2]: object 1
[3]: object 2
[4]: context device=0, [ 5 6 7 8 10 11 12 13 14 ]
[5]: persist node
[6]: copy in object 0
[7]: copy in object 1
[8]: copy in object 2
[9]: dependent write on object 2 by task 10
[10]: task=0, args=[ 6 7 8 0 ]
[11]: copy out object 2 after task 10
[12]: deallocate object 1 after 11
[13]: deallocate object 2 after 11
[14]: deallocate object 3 after 11
-----------------------------------
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void matrixMultiplication(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *A, __global uchar *B, __global uchar *C, __private int size)
{
  int i_3, i_34, i_28, i_9, i_41, i_8, i_11, i_10, i_42, i_37, i_4, i_17, i_13, i_12, i_15, i_14, i_27, i_20; 
  long l_39, l_29, l_30, l_21, l_22, l_38; 
  float f_24, f_16, f_32, f_33; 
  ulong ul_19, ul_18, ul_23, ul_25, ul_26, ul_31, ul_1, ul_0, ul_35, ul_2, ul_5, ul_36, ul_7, ul_6, ul_40; 

  // BLOCK 0
  ul_0  =  (ulong) A;
  ul_1  =  (ulong) B;
  ul_2  =  (ulong) C;
  i_3  =  get_global_size(0);
  i_4  =  get_global_size(1);
  ul_5  =  ul_2 + 32L;
  ul_6  =  ul_1 + 32L;
  ul_7  =  ul_0 + 32L;
  i_8  =  get_global_id(0);
  i_9  =  get_global_id(1);
  i_10  =  _kernel_context[0];
  // BLOCK 1 MERGES [0 8 ]
  i_11  =  i_9;
  for(;i_11 < i_10;)
  {
    // BLOCK 2
    i_12  =  i_11 << 9;
    i_13  =  i_12 + 6;
    // BLOCK 3 MERGES [2 7 ]
    i_14  =  i_8;
    for(;i_14 < i_10;)
    {
      // BLOCK 4
      i_15  =  i_14 + 6;
      // BLOCK 5 MERGES [4 6 ]
      f_16  =  0.0F;
      i_17  =  0;
      for(;i_17 < i_10;)
      {
        // BLOCK 6
        ul_18  =  *((__global ulong *) ul_7);
        ul_19  =  ul_0 + ul_18;
        i_20  =  i_13 + i_17;
        l_21  =  (long) i_20;
        l_22  =  l_21 << 2;
        ul_23  =  ul_19 + l_22;
        f_24  =  *((__global float *) ul_23);
        ul_25  =  *((__global ulong *) ul_6);
        ul_26  =  ul_1 + ul_25;
        i_27  =  i_17 << 9;
        i_28  =  i_27 + i_15;
        l_29  =  (long) i_28;
        l_30  =  l_29 << 2;
        ul_31  =  ul_26 + l_30;
        f_32  =  *((__global float *) ul_31);
        f_33  =  fma(f_24, f_32, f_16);
        i_34  =  i_17 + 1;
        f_16  =  f_33;
        i_17  =  i_34;
      }  // B6
      // BLOCK 7
      ul_35  =  *((__global ulong *) ul_5);
      ul_36  =  ul_2 + ul_35;
      i_37  =  i_14 + i_13;
      l_38  =  (long) i_37;
      l_39  =  l_38 << 2;
      ul_40  =  ul_36 + l_39;
      *((__global float *) ul_40)  =  f_16;
      i_41  =  i_3 + i_14;
      i_14  =  i_41;
    }  // B7
    // BLOCK 8
    i_42  =  i_4 + i_11;
    i_11  =  i_42;
  }  // B8
  // BLOCK 9
  return;
}  //  kernel

	Single Threaded CPU Execution: 1.64 GFlops, Total time = 164 ms
	Streams Execution: 14.91 GFlops, Total time = 18 ms
	TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
	Speedup: 164.0x
	Verification true
cleanup: programs  ..........0.000256351 s
cleanup: context   ..........0.000006982 s
cleanup: total     ..........0.000263333 s
./run_nvidia_openjdk.sh tornado --version
version=1.0.4
branch=master
commit=585574e

Backends installed: 
	 - opencl

I share the same opinion with @jjfumero, it will be easier to understand the problem with your platform if you can install it locally and run the example without docker. Do you have any modified source code?

@SirYwell
Copy link
Contributor Author

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

@stratika
Copy link
Collaborator

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

It seems that the problem is in the driver when it compiles the generated kernel from TornadoVM. The CL_BUILD_PROGRAM_FAILURE is returned if clBuildProgram does not return till the build of the kernel is completed. My guess would be that the driver does not communicate correctly with the underlying driver in your local system. This container is build based on the nvidia/opencl image.

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

@SirYwell
Copy link
Contributor Author

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

Thanks, I gave that a try and ran ./polyglotImages/polyglot-graalpy/tornadovm-polyglot-nvidia.sh tornado --printKernel --truffle python example/polyglot-examples/mxmWithTornadoVM.py. This seems to result in the same error as before sadly. (PS it looks like the docs are outdated, they reference a tornadovm-polyglot.sh file instead)

@jjfumero
Copy link
Member

jjfumero commented May 1, 2024

Going back to the initial problem:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11

Which is different from the error I encountered. To debug your error, I suggest 2 things:

  1. Run with --debug in TornadoVM to obtain the clBuild error messages.
  2. Use the kernel that TornadoVM generates and compiling it for your own system. You can build a C++ program that just builds and compiles OpenCL and obtain the info. You can take this program as inspiration: https://github.com/jjfumero/scripts/tree/master/opencl/compileKernel

@SirYwell
Copy link
Contributor Author

SirYwell commented May 1, 2024

Thanks for the pointers. https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820 already contains the output from running with --fullDebug (but only 1 iteration instead of 100).

The 2. seems to work, I can compile the example kernel and the one from the output I got from the MatrixMultiplication example. The compileKernel program crashes, but from my debugging, that only happens after the kernelBin.bin is already written (after the output of https://github.com/jjfumero/scripts/blob/c8e52c3e83bb7db529ab11f9ae1d61e738792d8d/opencl/compileKernel/compileKernel.cpp#L302 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants