Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for Cuda 12.1 #395

Open
nyck33 opened this issue Apr 3, 2024 · 11 comments
Open

support for Cuda 12.1 #395

nyck33 opened this issue Apr 3, 2024 · 11 comments

Comments

@nyck33
Copy link

nyck33 commented Apr 3, 2024

I named the example simple_matmul.cu and tried to compile like

nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/nobu-polygeist-demos$ cgeist simple_matmul.cu -function=matmul -S
warning: CUDA version 12.1 is only partially supported
error: GPU arch sm_35 is supported by CUDA versions between 7.0 and 11.8 (inclusive), but installation at /usr/local/cuda-12.1 is 12.1; use '--cuda-path' to specify a different CUDA install, pass a different GPU arch with '--cuda-gpu-arch', or pass '--no-cuda-version-check'
<built-in>:1:10: fatal error: '__clang_cuda_runtime_wrapper.h' file not found
    1 | #include "__clang_cuda_runtime_wrapper.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/nobu-polygeist-demos$ cgeist simple_matmul.cu -function=matmul -S --cuda-gpu-a
rch sm_75
warning: CUDA version 12.1 is only partially supported
<built-in>:1:10: fatal error: '__clang_cuda_runtime_wrapper.h' file not found
    1 | #include "__clang_cuda_runtime_wrapper.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/nobu-polygeist-demos$ 

will Cuda 12.1 be supported soon? What are some workarounds?
My nvidia-smi and nvcc --version are:

nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/build$ nvidia-smi
Wed Apr  3 23:18:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.54                 Driver Version: 531.97       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650         On | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P0               17W /  N/A|      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+


+---------------------------------------------------------------------------------------+
| Processes:
               |
|  GPU   GI   CI        PID   Type   Process name
    GPU Memory |
|        ID   ID
    Usage      |
|=======================================================================================|
|  No running processes found
               |
+---------------------------------------------------------------------------------------+
nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/build$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
@wsmoses
Copy link
Member

wsmoses commented Apr 3, 2024

cc @ivanradanov

So the support being partial comes from LLVM upstream.

It looks like here the issue is a header is missing, can you run find on your build dir to see if it was built anywhere?

@ivanradanov
Copy link
Collaborator

The partial support works for my machine.

Your issue here is that you need to point Polygeist to the correct clang resource dir, which would be $LLVM_BUILD_DIR/lib/clang/18 using --resource-dir=$LLVM_BUILD_DIR/lib/clang/18. I agree that this is under documented and should probably be handled automatically.

We need better support for installing polygeist and having it properly set up include paths link paths etc, need to look into how clang does it... But for the meantime better documentation would be nice...

@ivanradanov
Copy link
Collaborator

ivanradanov commented Apr 4, 2024

You are getting an empty module, can you try removing -function=matmul from the arguments? (that would limit polygeist to only compiling the functions called matmul.)

@nyck33
Copy link
Author

nyck33 commented Apr 4, 2024

@ivanradanov I'm sorry I deleted that as I realized some errors but here is fuller output incorporating your advice:

nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/nobu-polygeist-demos$ cgeist gemm.cu -O3 -function=* --resource-dir=$LLVM_BUILD_DIR/lib/clang/18 --cuda-gpu-arch=sm_75 --show-ast --emit-llvm
warning: CUDA version 12.1 is only partially supported
Emitting fn: _Z28__device_stub__matmul_kernelPfS_S_
matmul_kernel
Emitting fn: _Z6matmulPfS_S_
matmul
CompoundStmt 0x560c78fb69b8
|-DeclStmt 0x560c78fb4d40
| |-VarDecl 0x560c78fb4bb8  used d_A 'float *'
| |-VarDecl 0x560c78fb4c38  used d_B 'float *'
| `-VarDecl 0x560c78fb4cb8  used d_C 'float *'
|-CallExpr 0x560c78fb5050 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb5038 'cudaError_t (*)(void **, size_t)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb5018 'cudaError_t (void **, size_t)' lvalue Function 0x560c789919d8 'cudaMalloc' 'cudaError_t (void **, size_t)'
| |-CStyleCastExpr 0x560c78fb4e60 'void **' <BitCast>
| | `-UnaryOperator 0x560c78fb4e08 'float **' prefix '&' cannot overflow
| |   `-DeclRefExpr 0x560c78fb4db8 'float *' lvalue Var 0x560c78fb4bb8 'd_A' 'float *'
| `-BinaryOperator 0x560c78fb4f38 'unsigned long' '*'
|   |-ImplicitCastExpr 0x560c78fb4f20 'unsigned long' <IntegralCast>
|   | `-BinaryOperator 0x560c78fb4ec8 'int' '*'
|   |   |-IntegerLiteral 0x560c78fb4e88 'int' 200
|   |   `-IntegerLiteral 0x560c78fb4ea8 'int' 400
|   `-UnaryExprOrTypeTraitExpr 0x560c78fb4f00 'unsigned long' sizeof 'float'
|-CallExpr 0x560c78fb5320 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb5308 'cudaError_t (*)(void **, size_t)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb52e8 'cudaError_t (void **, size_t)' lvalue Function 0x560c789919d8 'cudaMalloc' 'cudaError_t (void **, size_t)'
| |-CStyleCastExpr 0x560c78fb5130 'void **' <BitCast>
| | `-UnaryOperator 0x560c78fb5100 'float **' prefix '&' cannot overflow
| |   `-DeclRefExpr 0x560c78fb50e0 'float *' lvalue Var 0x560c78fb4c38 'd_B' 'float *'
| `-BinaryOperator 0x560c78fb5208 'unsigned long' '*'
|   |-ImplicitCastExpr 0x560c78fb51f0 'unsigned long' <IntegralCast>
|   | `-BinaryOperator 0x560c78fb5198 'int' '*'
|   |   |-IntegerLiteral 0x560c78fb5158 'int' 400
|   |   `-IntegerLiteral 0x560c78fb5178 'int' 300
|   `-UnaryExprOrTypeTraitExpr 0x560c78fb51d0 'unsigned long' sizeof 'float'
|-CallExpr 0x560c78fb55f0 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb55d8 'cudaError_t (*)(void **, size_t)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb55b8 'cudaError_t (void **, size_t)' lvalue Function 0x560c789919d8 'cudaMalloc' 'cudaError_t (void **, size_t)'
| |-CStyleCastExpr 0x560c78fb5400 'void **' <BitCast>
| | `-UnaryOperator 0x560c78fb53d0 'float **' prefix '&' cannot overflow
| |   `-DeclRefExpr 0x560c78fb53b0 'float *' lvalue Var 0x560c78fb4cb8 'd_C' 'float *'
| `-BinaryOperator 0x560c78fb54d8 'unsigned long' '*'
|   |-ImplicitCastExpr 0x560c78fb54c0 'unsigned long' <IntegralCast>
|   | `-BinaryOperator 0x560c78fb5468 'int' '*'
|   |   |-IntegerLiteral 0x560c78fb5428 'int' 200
|   |   `-IntegerLiteral 0x560c78fb5448 'int' 300
|   `-UnaryExprOrTypeTraitExpr 0x560c78fb54a0 'unsigned long' sizeof 'float'
|-CallExpr 0x560c78fb5830 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb5818 'cudaError_t (*)(void *, const void *, size_t, enum cudaMemcpyKind)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb5798 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)' lvalue Function 0x560c789abf48 'cudaMemcpy' 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)'
| |-ImplicitCastExpr 0x560c78fb5888 'void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb5870 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb5668 'float *' lvalue Var 0x560c78fb4bb8 'd_A' 'float *'
| |-ImplicitCastExpr 0x560c78fb58b8 'const void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb58a0 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb5688 'float *' lvalue ParmVar 0x560c78fb4940 'A' 'float *'
| |-BinaryOperator 0x560c78fb5758 'unsigned long' '*'
| | |-ImplicitCastExpr 0x560c78fb5740 'unsigned long' <IntegralCast>
| | | `-BinaryOperator 0x560c78fb56e8 'int' '*'
| | |   |-IntegerLiteral 0x560c78fb56a8 'int' 200
| | |   `-IntegerLiteral 0x560c78fb56c8 'int' 400
| | `-UnaryExprOrTypeTraitExpr 0x560c78fb5720 'unsigned long' sizeof 'float'
| `-DeclRefExpr 0x560c78fb5778 'enum cudaMemcpyKind' EnumConstant 0x560c78910120 'cudaMemcpyHostToDevice' 'enum cudaMemcpyKind'
|-CallExpr 0x560c78fb5a80 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb5a68 'cudaError_t (*)(void *, const void *, size_t, enum cudaMemcpyKind)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb5a48 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)' lvalue Function 0x560c789abf48 'cudaMemcpy' 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)'
| |-ImplicitCastExpr 0x560c78fb5ad8 'void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb5ac0 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb5918 'float *' lvalue Var 0x560c78fb4c38 'd_B' 'float *'
| |-ImplicitCastExpr 0x560c78fb5b08 'const void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb5af0 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb5938 'float *' lvalue ParmVar 0x560c78fb49c0 'B' 'float *'
| |-BinaryOperator 0x560c78fb5a08 'unsigned long' '*'
| | |-ImplicitCastExpr 0x560c78fb59f0 'unsigned long' <IntegralCast>
| | | `-BinaryOperator 0x560c78fb5998 'int' '*'
| | |   |-IntegerLiteral 0x560c78fb5958 'int' 400
| | |   `-IntegerLiteral 0x560c78fb5978 'int' 300
| | `-UnaryExprOrTypeTraitExpr 0x560c78fb59d0 'unsigned long' sizeof 'float'
| `-DeclRefExpr 0x560c78fb5a28 'enum cudaMemcpyKind' EnumConstant 0x560c78910120 'cudaMemcpyHostToDevice' 'enum cudaMemcpyKind'
|-DeclStmt 0x560c78fb5c90
| `-VarDecl 0x560c78fb5b30  used blockDim 'dim3':'struct dim3' callinit
|   `-CXXConstructExpr 0x560c78fb5c50 'dim3':'struct dim3' 'void (unsigned int, unsigned int, unsigned int)'
|     |-ImplicitCastExpr 0x560c78fb5c00 'unsigned int' <IntegralCast>
|     | `-IntegerLiteral 0x560c78fb5b98 'int' 16
|     |-ImplicitCastExpr 0x560c78fb5c18 'unsigned int' <IntegralCast>
|     | `-IntegerLiteral 0x560c78fb5bb8 'int' 16
|     `-CXXDefaultArgExpr 0x560c78fb5c30 'unsigned int'
|-DeclStmt 0x560c78fb6128
| `-VarDecl 0x560c78fb5cb8  used gridDim 'dim3':'struct dim3' callinit
|   `-CXXConstructExpr 0x560c78fb60e8 'dim3':'struct dim3' 'void (unsigned int, unsigned int, unsigned int)'
|     |-BinaryOperator 0x560c78fb5ec0 'unsigned int' '/'
|     | |-ParenExpr 0x560c78fb5e38 'unsigned int'
|     | | `-BinaryOperator 0x560c78fb5e18 'unsigned int' '-'
|     | |   |-BinaryOperator 0x560c78fb5dc0 'unsigned int' '+'
|     | |   | |-ImplicitCastExpr 0x560c78fb5da8 'unsigned int' <IntegralCast>
|     | |   | | `-IntegerLiteral 0x560c78fb5d20 'int' 300
|     | |   | `-ImplicitCastExpr 0x560c78fb5d90 'unsigned int' <LValueToRValue>
|     | |   |   `-MemberExpr 0x560c78fb5d60 'unsigned int' lvalue .x 0x560c788dabb8
|     | |   |     `-DeclRefExpr 0x560c78fb5d40 'dim3':'struct dim3' lvalue Var 0x560c78fb5b30 'blockDim' 'dim3':'struct dim3'
|     | |   `-ImplicitCastExpr 0x560c78fb5e00 'unsigned int' <IntegralCast>
|     | |     `-IntegerLiteral 0x560c78fb5de0 'int' 1
|     | `-ImplicitCastExpr 0x560c78fb5ea8 'unsigned int' <LValueToRValue>
|     |   `-MemberExpr 0x560c78fb5e78 'unsigned int' lvalue .x 0x560c788dabb8
|     |     `-DeclRefExpr 0x560c78fb5e58 'dim3':'struct dim3' lvalue Var 0x560c78fb5b30 'blockDim' 'dim3':'struct dim3'
|     |-BinaryOperator 0x560c78fb6080 'unsigned int' '/'
|     | |-ParenExpr 0x560c78fb5ff8 'unsigned int'
|     | | `-BinaryOperator 0x560c78fb5fd8 'unsigned int' '-'
|     | |   |-BinaryOperator 0x560c78fb5f80 'unsigned int' '+'
|     | |   | |-ImplicitCastExpr 0x560c78fb5f68 'unsigned int' <IntegralCast>
|     | |   | | `-IntegerLiteral 0x560c78fb5ee0 'int' 200
|     | |   | `-ImplicitCastExpr 0x560c78fb5f50 'unsigned int' <LValueToRValue>
|     | |   |   `-MemberExpr 0x560c78fb5f20 'unsigned int' lvalue .y 0x560c788dac20
|     | |   |     `-DeclRefExpr 0x560c78fb5f00 'dim3':'struct dim3' lvalue Var 0x560c78fb5b30 'blockDim' 'dim3':'struct dim3'
|     | |   `-ImplicitCastExpr 0x560c78fb5fc0 'unsigned int' <IntegralCast>
|     | |     `-IntegerLiteral 0x560c78fb5fa0 'int' 1
|     | `-ImplicitCastExpr 0x560c78fb6068 'unsigned int' <LValueToRValue>
|     |   `-MemberExpr 0x560c78fb6038 'unsigned int' lvalue .y 0x560c788dac20
|     |     `-DeclRefExpr 0x560c78fb6018 'dim3':'struct dim3' lvalue Var 0x560c78fb5b30 'blockDim' 'dim3':'struct dim3'
|     `-CXXDefaultArgExpr 0x560c78fb60c8 'unsigned int'
|-CUDAKernelCallExpr 0x560c78fb63e0 'void'
| |-ImplicitCastExpr 0x560c78fb63c8 'void (*)(float *, float *, float *)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb6140 'void (float *, float *, float *)' lvalue Function 0x560c78fb3668 'matmul_kernel' 'void (float *, float *, float *)'
| |-CallExpr 0x560c78fb6230 'unsigned int'
| | |-ImplicitCastExpr 0x560c78fb6218 'unsigned int (*)(dim3, dim3, size_t, void *)' <FunctionToPointerDecay>
| | | `-DeclRefExpr 0x560c78fb61a0 'unsigned int (dim3, dim3, size_t, void *)' lvalue Function 0x560c78fb33a0 '__cudaPushCallConfiguration' 'unsigned int (dim3, dim3, size_t, void *)'
| | |-CXXConstructExpr 0x560c78fb6288 'dim3':'struct dim3' 'void (const dim3 &) noexcept'
| | | `-ImplicitCastExpr 0x560c78fb6270 'const dim3':'const struct dim3' lvalue <NoOp>
| | |   `-DeclRefExpr 0x560c78fb6160 'dim3':'struct dim3' lvalue Var 0x560c78fb5cb8 'gridDim' 'dim3':'struct dim3'    
| | |-CXXConstructExpr 0x560c78fb62d0 'dim3':'struct dim3' 'void (const dim3 &) noexcept'
| | | `-ImplicitCastExpr 0x560c78fb62b8 'const dim3':'const struct dim3' lvalue <NoOp>
| | |   `-DeclRefExpr 0x560c78fb6180 'dim3':'struct dim3' lvalue Var 0x560c78fb5b30 'blockDim' 'dim3':'struct dim3'   
| | |-CXXDefaultArgExpr 0x560c78fb6300 'size_t':'unsigned long'
| | `-CXXDefaultArgExpr 0x560c78fb6320 'void *'
| |-ImplicitCastExpr 0x560c78fb6420 'float *' <LValueToRValue>
| | `-DeclRefExpr 0x560c78fb6340 'float *' lvalue Var 0x560c78fb4bb8 'd_A' 'float *'
| |-ImplicitCastExpr 0x560c78fb6438 'float *' <LValueToRValue>
| | `-DeclRefExpr 0x560c78fb6360 'float *' lvalue Var 0x560c78fb4c38 'd_B' 'float *'
| `-ImplicitCastExpr 0x560c78fb6450 'float *' <LValueToRValue>
|   `-DeclRefExpr 0x560c78fb6380 'float *' lvalue Var 0x560c78fb4cb8 'd_C' 'float *'
|-CallExpr 0x560c78fb6618 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb6600 'cudaError_t (*)(void *, const void *, size_t, enum cudaMemcpyKind)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb65e0 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)' lvalue Function 0x560c789abf48 'cudaMemcpy' 'cudaError_t (void *, const void *, size_t, enum cudaMemcpyKind)'
| |-ImplicitCastExpr 0x560c78fb6670 'void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb6658 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb64b0 'float *' lvalue ParmVar 0x560c78fb4a40 'C' 'float *'
| |-ImplicitCastExpr 0x560c78fb66a0 'const void *' <BitCast>
| | `-ImplicitCastExpr 0x560c78fb6688 'float *' <LValueToRValue>
| |   `-DeclRefExpr 0x560c78fb64d0 'float *' lvalue Var 0x560c78fb4cb8 'd_C' 'float *'
| |-BinaryOperator 0x560c78fb65a0 'unsigned long' '*'
| | |-ImplicitCastExpr 0x560c78fb6588 'unsigned long' <IntegralCast>
| | | `-BinaryOperator 0x560c78fb6530 'int' '*'
| | |   |-IntegerLiteral 0x560c78fb64f0 'int' 200
| | |   `-IntegerLiteral 0x560c78fb6510 'int' 300
| | `-UnaryExprOrTypeTraitExpr 0x560c78fb6568 'unsigned long' sizeof 'float'
| `-DeclRefExpr 0x560c78fb65c0 'enum cudaMemcpyKind' EnumConstant 0x560c789101b0 'cudaMemcpyDeviceToHost' 'enum cudaMemcpyKind'
|-CallExpr 0x560c78fb6760 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb6748 'cudaError_t (*)(void *)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb6728 'cudaError_t (void *)' lvalue Function 0x560c789a75e8 'cudaFree' 'cudaError_t (void *)'
| `-ImplicitCastExpr 0x560c78fb67a0 'void *' <BitCast>
|   `-ImplicitCastExpr 0x560c78fb6788 'float *' <LValueToRValue>
|     `-DeclRefExpr 0x560c78fb6708 'float *' lvalue Var 0x560c78fb4bb8 'd_A' 'float *'
|-CallExpr 0x560c78fb6860 'cudaError_t':'enum cudaError'
| |-ImplicitCastExpr 0x560c78fb6848 'cudaError_t (*)(void *)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x560c78fb6828 'cudaError_t (void *)' lvalue Function 0x560c789a75e8 'cudaFree' 'cudaError_t (void *)'
| `-ImplicitCastExpr 0x560c78fb68a0 'void *' <BitCast>
|   `-ImplicitCastExpr 0x560c78fb6888 'float *' <LValueToRValue>
|     `-DeclRefExpr 0x560c78fb6808 'float *' lvalue Var 0x560c78fb4c38 'd_B' 'float *'
`-CallExpr 0x560c78fb6960 'cudaError_t':'enum cudaError'
  |-ImplicitCastExpr 0x560c78fb6948 'cudaError_t (*)(void *)' <FunctionToPointerDecay>
  | `-DeclRefExpr 0x560c78fb6928 'cudaError_t (void *)' lvalue Function 0x560c789a75e8 'cudaFree' 'cudaError_t (void *)'
  `-ImplicitCastExpr 0x560c78fb69a0 'void *' <BitCast>
    `-ImplicitCastExpr 0x560c78fb6988 'float *' <LValueToRValue>
      `-DeclRefExpr 0x560c78fb6908 'float *' lvalue Var 0x560c78fb4cb8 'd_C' 'float *'
Emitting fn: _ZN4dim3C1Ejjj
dim3
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dabb8 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:18> col:18 referenced x 'unsigned int'
ImplicitCastExpr 0x560c788db670 'unsigned int' <LValueToRValue>
`-DeclRefExpr 0x560c788db630 'unsigned int' lvalue ParmVar 0x560c788dad10 'vx' 'unsigned int'
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dac20 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:21> col:21 referenced y 'unsigned int'
ImplicitCastExpr 0x560c788db6e8 'unsigned int' <LValueToRValue>
`-DeclRefExpr 0x560c788db6a8 'unsigned int' lvalue ParmVar 0x560c788dad90 'vy' 'unsigned int'
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dac88 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:24> col:24 referenced z 'unsigned int'
ImplicitCastExpr 0x560c788db760 'unsigned int' <LValueToRValue>
`-DeclRefExpr 0x560c788db720 'unsigned int' lvalue ParmVar 0x560c788dae10 'vz' 'unsigned int'
CompoundStmt 0x560c788db7b0
Emitting fn: _ZN4dim3C1ERKS_
dim3
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dabb8 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:18> col:18 referenced x 'unsigned int'
ImplicitCastExpr 0x560c78977010 'unsigned int' <LValueToRValue>
`-MemberExpr 0x560c78976fe0 'const unsigned int' lvalue .x 0x560c788dabb8
  `-DeclRefExpr 0x560c78976fc0 'const dim3':'const struct dim3' lvalue ParmVar 0x560c78976548 '' 'const dim3 &'       
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dac20 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:21> col:21 referenced y 'unsigned int'
ImplicitCastExpr 0x560c78977098 'unsigned int' <LValueToRValue>
`-MemberExpr 0x560c78977068 'const unsigned int' lvalue .y 0x560c788dac20
  `-DeclRefExpr 0x560c78977048 'const dim3':'const struct dim3' lvalue ParmVar 0x560c78976548 '' 'const dim3 &'       
 init: - baseInit:0 memberInit:1 anyMember:1 indirectMember:0 isinClass:0 delegating:0 isPack:0
FieldDecl 0x560c788dac88 </usr/local/cuda-12.1/include/vector_types.h:420:5, col:24> col:24 referenced z 'unsigned int'
ImplicitCastExpr 0x560c78977120 'unsigned int' <LValueToRValue>
`-MemberExpr 0x560c789770f0 'const unsigned int' lvalue .z 0x560c788dac88
  `-DeclRefExpr 0x560c789770d0 'const dim3':'const struct dim3' lvalue ParmVar 0x560c78976548 '' 'const dim3 &'       
CompoundStmt 0x560c78977170
warning: CUDA version 12.1 is only partially supported
Emitting fn: _Z13matmul_kernelPfS_S_
matmul_kernel
CompoundStmt 0x560c78f9e570
|-DeclStmt 0x560c78f9d870
| `-VarDecl 0x560c78f9d408  used row 'int' cinit
|   `-ImplicitCastExpr 0x560c78f9d858 'int' <IntegralCast>
|     `-BinaryOperator 0x560c78f9d838 'unsigned int' '+'
|       |-BinaryOperator 0x560c78f9d6e0 'unsigned int' '*'
|       | |-PseudoObjectExpr 0x560c78f9d5d8 'unsigned int'
|       | | |-MSPropertyRefExpr 0x560c78f9d538 '<pseudo-object type>' lvalue
|       | | | `-OpaqueValueExpr 0x560c78f9d520 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | | |   `-DeclRefExpr 0x560c78f9d470 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t'     
|       | | |-OpaqueValueExpr 0x560c78f9d520 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | | | `-DeclRefExpr 0x560c78f9d470 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t'       
|       | | `-CallExpr 0x560c78f9d5b8 'unsigned int'
|       | |   `-ImplicitCastExpr 0x560c78f9d5a0 'unsigned int (*)(void)' <FunctionToPointerDecay>
|       | |     `-MemberExpr 0x560c78f9d570 'unsigned int (void)' lvalue .__fetch_builtin_y 0x560c78e1ca78
|       | |       `-OpaqueValueExpr 0x560c78f9d520 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | |         `-DeclRefExpr 0x560c78f9d470 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' 
|       | `-PseudoObjectExpr 0x560c78f9d6b8 'unsigned int'
|       |   |-MSPropertyRefExpr 0x560c78f9d618 '<pseudo-object type>' lvalue
|       |   | `-OpaqueValueExpr 0x560c78f9d600 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |   |   `-DeclRefExpr 0x560c78f9d4c8 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t'     
|       |   |-OpaqueValueExpr 0x560c78f9d600 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |   | `-DeclRefExpr 0x560c78f9d4c8 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t'       
|       |   `-CallExpr 0x560c78f9d698 'unsigned int'
|       |     `-ImplicitCastExpr 0x560c78f9d680 'unsigned int (*)(void)' <FunctionToPointerDecay>
|       |       `-MemberExpr 0x560c78f9d650 'unsigned int (void)' lvalue .__fetch_builtin_y 0x560c78e1e4d8
|       |         `-OpaqueValueExpr 0x560c78f9d600 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |           `-DeclRefExpr 0x560c78f9d4c8 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' 
|       `-PseudoObjectExpr 0x560c78f9d810 'unsigned int'
|         |-MSPropertyRefExpr 0x560c78f9d770 '<pseudo-object type>' lvalue
|         | `-OpaqueValueExpr 0x560c78f9d758 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|         |   `-DeclRefExpr 0x560c78f9d700 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'  
|         |-OpaqueValueExpr 0x560c78f9d758 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|         | `-DeclRefExpr 0x560c78f9d700 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'    
|         `-CallExpr 0x560c78f9d7f0 'unsigned int'
|           `-ImplicitCastExpr 0x560c78f9d7d8 'unsigned int (*)(void)' <FunctionToPointerDecay>
|             `-MemberExpr 0x560c78f9d7a8 'unsigned int (void)' lvalue .__fetch_builtin_y 0x560c78e1aec8
|               `-OpaqueValueExpr 0x560c78f9d758 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|                 `-DeclRefExpr 0x560c78f9d700 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'
|-DeclStmt 0x560c78f9dd08
| `-VarDecl 0x560c78f9d8a0  used col 'int' cinit
|   `-ImplicitCastExpr 0x560c78f9dcf0 'int' <IntegralCast>
|     `-BinaryOperator 0x560c78f9dcd0 'unsigned int' '+'
|       |-BinaryOperator 0x560c78f9db78 'unsigned int' '*'
|       | |-PseudoObjectExpr 0x560c78f9da70 'unsigned int'
|       | | |-MSPropertyRefExpr 0x560c78f9d9d0 '<pseudo-object type>' lvalue
|       | | | `-OpaqueValueExpr 0x560c78f9d9b8 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | | |   `-DeclRefExpr 0x560c78f9d908 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t'     
|       | | |-OpaqueValueExpr 0x560c78f9d9b8 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | | | `-DeclRefExpr 0x560c78f9d908 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t'       
|       | | `-CallExpr 0x560c78f9da50 'unsigned int'
|       | |   `-ImplicitCastExpr 0x560c78f9da38 'unsigned int (*)(void)' <FunctionToPointerDecay>
|       | |     `-MemberExpr 0x560c78f9da08 'unsigned int (void)' lvalue .__fetch_builtin_x 0x560c78e1c840
|       | |       `-OpaqueValueExpr 0x560c78f9d9b8 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue
|       | |         `-DeclRefExpr 0x560c78f9d908 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' lvalue Var 0x560c78e25ce0 'blockIdx' 'const __cuda_builtin_blockIdx_t':'const struct __cuda_builtin_blockIdx_t' 
|       | `-PseudoObjectExpr 0x560c78f9db50 'unsigned int'
|       |   |-MSPropertyRefExpr 0x560c78f9dab0 '<pseudo-object type>' lvalue
|       |   | `-OpaqueValueExpr 0x560c78f9da98 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |   |   `-DeclRefExpr 0x560c78f9d960 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t'     
|       |   |-OpaqueValueExpr 0x560c78f9da98 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |   | `-DeclRefExpr 0x560c78f9d960 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t'       
|       |   `-CallExpr 0x560c78f9db30 'unsigned int'
|       |     `-ImplicitCastExpr 0x560c78f9db18 'unsigned int (*)(void)' <FunctionToPointerDecay>
|       |       `-MemberExpr 0x560c78f9dae8 'unsigned int (void)' lvalue .__fetch_builtin_x 0x560c78e1e2a0
|       |         `-OpaqueValueExpr 0x560c78f9da98 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue
|       |           `-DeclRefExpr 0x560c78f9d960 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' lvalue Var 0x560c78e25dd8 'blockDim' 'const __cuda_builtin_blockDim_t':'const struct __cuda_builtin_blockDim_t' 
|       `-PseudoObjectExpr 0x560c78f9dca8 'unsigned int'
|         |-MSPropertyRefExpr 0x560c78f9dc08 '<pseudo-object type>' lvalue
|         | `-OpaqueValueExpr 0x560c78f9dbf0 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|         |   `-DeclRefExpr 0x560c78f9db98 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'  
|         |-OpaqueValueExpr 0x560c78f9dbf0 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|         | `-DeclRefExpr 0x560c78f9db98 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'    
|         `-CallExpr 0x560c78f9dc88 'unsigned int'
|           `-ImplicitCastExpr 0x560c78f9dc70 'unsigned int (*)(void)' <FunctionToPointerDecay>
|             `-MemberExpr 0x560c78f9dc40 'unsigned int (void)' lvalue .__fetch_builtin_x 0x560c78e1ac90
|               `-OpaqueValueExpr 0x560c78f9dbf0 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue
|                 `-DeclRefExpr 0x560c78f9db98 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t' lvalue Var 0x560c78e25be8 'threadIdx' 'const __cuda_builtin_threadIdx_t':'const struct __cuda_builtin_threadIdx_t'
`-IfStmt 0x560c78f9e550
  |-BinaryOperator 0x560c78f9de10 '_Bool' '&&'
  | |-BinaryOperator 0x560c78f9dd78 '_Bool' '<'
  | | |-ImplicitCastExpr 0x560c78f9dd60 'int' <LValueToRValue>
  | | | `-DeclRefExpr 0x560c78f9dd20 'int' lvalue Var 0x560c78f9d408 'row' 'int'
  | | `-IntegerLiteral 0x560c78f9dd40 'int' 200
  | `-BinaryOperator 0x560c78f9ddf0 '_Bool' '<'
  |   |-ImplicitCastExpr 0x560c78f9ddd8 'int' <LValueToRValue>
  |   | `-DeclRefExpr 0x560c78f9dd98 'int' lvalue Var 0x560c78f9d8a0 'col' 'int'
  |   `-IntegerLiteral 0x560c78f9ddb8 'int' 300
  `-CompoundStmt 0x560c78f9e528
    |-DeclStmt 0x560c78f9dee8
    | `-VarDecl 0x560c78f9de48  used sum 'float' cinit
    |   `-ImplicitCastExpr 0x560c78f9ded0 'float' <IntegralToFloating>
    |     `-IntegerLiteral 0x560c78f9deb0 'int' 0
    |-ForStmt 0x560c78f9e370
    | |-DeclStmt 0x560c78f9dfa0
    | | `-VarDecl 0x560c78f9df18  used k 'int' cinit
    | |   `-IntegerLiteral 0x560c78f9df80 'int' 0
    | |-<<<NULL>>>
    | |-BinaryOperator 0x560c78f9e010 '_Bool' '<'
    | | |-ImplicitCastExpr 0x560c78f9dff8 'int' <LValueToRValue>
    | | | `-DeclRefExpr 0x560c78f9dfb8 'int' lvalue Var 0x560c78f9df18 'k' 'int'
    | | `-IntegerLiteral 0x560c78f9dfd8 'int' 400
    | |-UnaryOperator 0x560c78f9e050 'int' postfix '++'
    | | `-DeclRefExpr 0x560c78f9e030 'int' lvalue Var 0x560c78f9df18 'k' 'int'
    | `-CompoundStmt 0x560c78f9e358
    |   `-CompoundAssignOperator 0x560c78f9e328 'float' lvalue '+=' ComputeLHSTy='float' ComputeResultTy='float'      
    |     |-DeclRefExpr 0x560c78f9e068 'float' lvalue Var 0x560c78f9de48 'sum' 'float'
    |     `-BinaryOperator 0x560c78f9e308 'float' '*'
    |       |-ImplicitCastExpr 0x560c78f9e2d8 'float' <LValueToRValue>
    |       | `-ArraySubscriptExpr 0x560c78f9e190 'float' lvalue
    |       |   |-ImplicitCastExpr 0x560c78f9e178 'float *' <LValueToRValue>
    |       |   | `-DeclRefExpr 0x560c78f9e088 'float *' lvalue ParmVar 0x560c78f9d0f0 'A' 'float *'
    |       |   `-BinaryOperator 0x560c78f9e158 'int' '+'
    |       |     |-BinaryOperator 0x560c78f9e100 'int' '*'
    |       |     | |-ImplicitCastExpr 0x560c78f9e0e8 'int' <LValueToRValue>
    |       |     | | `-DeclRefExpr 0x560c78f9e0a8 'int' lvalue Var 0x560c78f9d408 'row' 'int'
    |       |     | `-IntegerLiteral 0x560c78f9e0c8 'int' 400
    |       |     `-ImplicitCastExpr 0x560c78f9e140 'int' <LValueToRValue>
    |       |       `-DeclRefExpr 0x560c78f9e120 'int' lvalue Var 0x560c78f9df18 'k' 'int'
    |       `-ImplicitCastExpr 0x560c78f9e2f0 'float' <LValueToRValue>
    |         `-ArraySubscriptExpr 0x560c78f9e2b8 'float' lvalue
    |           |-ImplicitCastExpr 0x560c78f9e2a0 'float *' <LValueToRValue>
    |           | `-DeclRefExpr 0x560c78f9e1b0 'float *' lvalue ParmVar 0x560c78f9d170 'B' 'float *'
    |           `-BinaryOperator 0x560c78f9e280 'int' '+'
    |             |-BinaryOperator 0x560c78f9e228 'int' '*'
    |             | |-ImplicitCastExpr 0x560c78f9e210 'int' <LValueToRValue>
    |             | | `-DeclRefExpr 0x560c78f9e1d0 'int' lvalue Var 0x560c78f9df18 'k' 'int'
    |             | `-IntegerLiteral 0x560c78f9e1f0 'int' 300
    |             `-ImplicitCastExpr 0x560c78f9e268 'int' <LValueToRValue>
    |               `-DeclRefExpr 0x560c78f9e248 'int' lvalue Var 0x560c78f9d8a0 'col' 'int'
    `-BinaryOperator 0x560c78f9e508 'float' lvalue '='
      |-ArraySubscriptExpr 0x560c78f9e4b0 'float' lvalue
      | |-ImplicitCastExpr 0x560c78f9e498 'float *' <LValueToRValue>
      | | `-DeclRefExpr 0x560c78f9e3a8 'float *' lvalue ParmVar 0x560c78f9d1f0 'C' 'float *'
      | `-BinaryOperator 0x560c78f9e478 'int' '+'
      |   |-BinaryOperator 0x560c78f9e420 'int' '*'
      |   | |-ImplicitCastExpr 0x560c78f9e408 'int' <LValueToRValue>
      |   | | `-DeclRefExpr 0x560c78f9e3c8 'int' lvalue Var 0x560c78f9d408 'row' 'int'
      |   | `-IntegerLiteral 0x560c78f9e3e8 'int' 300
      |   `-ImplicitCastExpr 0x560c78f9e460 'int' <LValueToRValue>
      |     `-DeclRefExpr 0x560c78f9e440 'int' lvalue Var 0x560c78f9d8a0 'col' 'int'
      `-ImplicitCastExpr 0x560c78f9e4f0 'float' <LValueToRValue>
        `-DeclRefExpr 0x560c78f9e4d0 'float' lvalue Var 0x560c78f9de48 'sum' 'float'
loc(callsite("gemm.cu":10:15 at "/usr/local/cuda-12.1/include/crt/host_defines.h":96:9)): error: cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: gpu.block_id
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<i1, dense<8> : vector<2xi32>>, #dlti.dl_entry<!llvm.ptr, dense<64> : vector<4xi32>>, #dlti.dl_entry<i16, dense<16> : vector<2xi32>>, #dlti.dl_entry<i8, dense<8> : vector<2xi32>>, #dlti.dl_entry<i32, dense<32> : vector<2xi32>>, #dlti.dl_entry<f64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f128, dense<128> : vector<2xi32>>, #dlti.dl_entry<f16, dense<16> : vector<2xi32>>, #dlti.dl_entry<!llvm.ptr<270>, dense<32> : vector<4xi32>>, #dlti.dl_entry<!llvm.ptr<271>, dense<32> : vector<4xi32>>, #dlti.dl_entry<!llvm.ptr<272>, dense<64> : vector<4xi32>>, #dlti.dl_entry<i64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f80, dense<128> : vector<2xi32>>, #dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<"dlti.stack_alignment", 128 : i32>>, llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu", polygeist.gpu_module.llvm.data_layout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64", polygeist.gpu_module.llvm.target_triple = "nvptx64-nvidia-cuda", "polygeist.target-cpu" = "x86-64", "polygeist.target-features" = "+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87", "polygeist.tune-cpu" = "generic"} {
  llvm.func @_Z28__device_stub__matmul_kernelPfS_S_(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: !llvm.ptr) attributes {polygeist.device_only_func = "1", sym_visibility = "private"} {
    %0 = llvm.mlir.constant(300 : index) : i64
    %1 = llvm.mlir.constant(400 : index) : i64
    %2 = llvm.mlir.constant(0 : index) : i64
    %3 = llvm.mlir.constant(1 : index) : i64
    %4 = llvm.mlir.constant(400 : i32) : i32
    %5 = llvm.mlir.constant(0.000000e+00 : f32) : f32
    %6 = llvm.mlir.constant(300 : i32) : i32
    %7 = llvm.mlir.constant(200 : i32) : i32
    %8 = gpu.block_id  y
    %9 = builtin.unrealized_conversion_cast %8 : index to i64
    %10 = llvm.trunc %9 : i64 to i32
    %11 = gpu.block_dim  y
    %12 = builtin.unrealized_conversion_cast %11 : index to i64
    %13 = llvm.trunc %12 : i64 to i32
    %14 = llvm.mul %10, %13  : i32
    %15 = gpu.thread_id  y
    %16 = builtin.unrealized_conversion_cast %15 : index to i64
    %17 = llvm.trunc %16 : i64 to i32
    %18 = llvm.add %14, %17  : i32
    %19 = llvm.sext %18 : i32 to i64
    %20 = gpu.block_id  x
    %21 = builtin.unrealized_conversion_cast %20 : index to i64
    %22 = llvm.trunc %21 : i64 to i32
    %23 = gpu.block_dim  x
    %24 = builtin.unrealized_conversion_cast %23 : index to i64
    %25 = llvm.trunc %24 : i64 to i32
    %26 = llvm.mul %22, %25  : i32
    %27 = gpu.thread_id  x
    %28 = builtin.unrealized_conversion_cast %27 : index to i64
    %29 = llvm.trunc %28 : i64 to i32
    %30 = llvm.add %26, %29  : i32
    %31 = llvm.sext %30 : i32 to i64
    %32 = llvm.icmp "slt" %18, %7 : i32
    %33 = llvm.icmp "slt" %30, %6 : i32
    %34 = llvm.and %32, %33  : i1
    llvm.cond_br %34, ^bb1, ^bb5
  ^bb1:  // pred: ^bb0
    %35 = llvm.mul %18, %4  : i32
    llvm.br ^bb2(%2, %5 : i64, f32)
  ^bb2(%36: i64, %37: f32):  // 2 preds: ^bb1, ^bb3
    %38 = llvm.icmp "slt" %36, %1 : i64
    llvm.cond_br %38, ^bb3, ^bb4
  ^bb3:  // pred: ^bb2
    %39 = llvm.trunc %36 : i64 to i32
    %40 = llvm.add %35, %39  : i32
    %41 = llvm.sext %40 : i32 to i64
    %42 = llvm.getelementptr %arg0[%41] : (!llvm.ptr, i64) -> !llvm.ptr, f32
    %43 = llvm.load %42 : !llvm.ptr -> f32
    %44 = llvm.mul %39, %6  : i32
    %45 = llvm.add %44, %30  : i32
    %46 = llvm.sext %45 : i32 to i64
    %47 = llvm.getelementptr %arg1[%46] : (!llvm.ptr, i64) -> !llvm.ptr, f32
    %48 = llvm.load %47 : !llvm.ptr -> f32
    %49 = llvm.fmul %43, %48  : f32
    %50 = llvm.fadd %37, %49  : f32
    %51 = llvm.add %36, %3  : i64
    llvm.br ^bb2(%51, %50 : i64, f32)
  ^bb4:  // pred: ^bb2
    %52 = llvm.mul %19, %0  : i64
    %53 = llvm.add %52, %31  : i64
    %54 = llvm.getelementptr %arg2[%53] : (!llvm.ptr, i64) -> !llvm.ptr, f32
    llvm.store %37, %54 : f32, !llvm.ptr
    llvm.br ^bb5
  ^bb5:  // 2 preds: ^bb0, ^bb4
    llvm.return
  }
  llvm.func @_Z6matmulPfS_S_(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: !llvm.ptr) {
    %0 = llvm.mlir.constant(1 : i32) : i32
    %1 = llvm.mlir.constant(2 : i32) : i32
    %2 = llvm.mlir.constant(240000 : i64) : i64
    %3 = llvm.mlir.constant(480000 : i64) : i64
    %4 = llvm.mlir.constant(320000 : i64) : i64
    %5 = llvm.mlir.constant(19 : index) : i64
    %6 = llvm.mlir.constant(13 : index) : i64
    %7 = llvm.mlir.constant(16 : index) : i64
    %8 = llvm.mlir.constant(1 : index) : i64
    %9 = builtin.unrealized_conversion_cast %8 : i64 to index
    %10 = builtin.unrealized_conversion_cast %7 : i64 to index
    %11 = builtin.unrealized_conversion_cast %6 : i64 to index
    %12 = builtin.unrealized_conversion_cast %5 : i64 to index
    %13 = llvm.alloca %8 x !llvm.ptr : (i64) -> !llvm.ptr
    %14 = llvm.alloca %8 x !llvm.ptr : (i64) -> !llvm.ptr
    %15 = llvm.alloca %8 x !llvm.ptr : (i64) -> !llvm.ptr
    %16 = llvm.call @cudaMalloc(%15, %4) : (!llvm.ptr, i64) -> i32
    %17 = llvm.call @cudaMalloc(%14, %3) : (!llvm.ptr, i64) -> i32
    %18 = llvm.call @cudaMalloc(%13, %2) : (!llvm.ptr, i64) -> i32
    %19 = llvm.load %15 : !llvm.ptr -> !llvm.ptr
    %20 = llvm.call @cudaMemcpy(%19, %arg0, %4, %0) : (!llvm.ptr, !llvm.ptr, i64, i32) -> i32
    %21 = llvm.load %14 : !llvm.ptr -> !llvm.ptr
    %22 = llvm.call @cudaMemcpy(%21, %arg1, %3, %0) : (!llvm.ptr, !llvm.ptr, i64, i32) -> i32
    %23 = llvm.load %15 : !llvm.ptr -> !llvm.ptr
    %24 = llvm.load %14 : !llvm.ptr -> !llvm.ptr
    %25 = llvm.load %13 : !llvm.ptr -> !llvm.ptr
    gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %12, %arg10 = %11, %arg11 = %9) threads(%arg6, %arg7, %arg8) in (%arg12 = %10, %arg13 = %10, %arg14 = %9) {
      llvm.call @_Z28__device_stub__matmul_kernelPfS_S_(%23, %24, %25) : (!llvm.ptr, !llvm.ptr, !llvm.ptr) -> ()      
      gpu.terminator
    }
    %26 = llvm.load %13 : !llvm.ptr -> !llvm.ptr
    %27 = llvm.call @cudaMemcpy(%arg2, %26, %2, %1) : (!llvm.ptr, !llvm.ptr, i64, i32) -> i32
    %28 = llvm.load %15 : !llvm.ptr -> !llvm.ptr
    %29 = llvm.call @cudaFree(%28) : (!llvm.ptr) -> i32
    %30 = llvm.load %14 : !llvm.ptr -> !llvm.ptr
    %31 = llvm.call @cudaFree(%30) : (!llvm.ptr) -> i32
    %32 = llvm.load %13 : !llvm.ptr -> !llvm.ptr
    %33 = llvm.call @cudaFree(%32) : (!llvm.ptr) -> i32
    llvm.return
  }
  llvm.func @cudaMalloc(!llvm.ptr, i64) -> i32 attributes {sym_visibility = "private"}
  llvm.func @cudaMemcpy(!llvm.ptr, !llvm.ptr, i64, i32) -> i32 attributes {sym_visibility = "private"}
  llvm.func @cudaFree(!llvm.ptr) -> i32 attributes {sym_visibility = "private"}
}
Failed to emit LLVM IR
nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/nobu-polygeist-demos$ 

@ivanradanov
Copy link
Collaborator

If you would like to see MLIR (raised to affine) you need -S -raise-scf-to-affine

@nyck33
Copy link
Author

nyck33 commented Apr 4, 2024

I'll have to study more to understand why I need to output assembly in order to raise to affine. Thanks though.
Regarding that error "Failed to emit LLVM IR" is "LVM IR: missing 'LLVMTranslationDialectInterface' registration for dialect for op: gpu.block_id, suggests that the operation gpu.block_id cannot be translated into LLVM IR due to a missing interface registration. This typically occurs when the MLIR to LLVM IR conversion infrastructure doesn't have the necessary mappings or handlers to translate MLIR GPU dialect operations to their corresponding LLVM IR representations." the gist of it and I need to go back and check what happened during step 1 of the build which was "build LLVM, MLIR and Clang"?

@ivanradanov
Copy link
Collaborator

So, this is Polygeist specific, but in order to emit GPU executables we require the programmer to explicitly specify the target on the command line as such: -emit-cuda. This is because we can also do -emit-rocm, and the compiler needs to know whether to translate gpu.block_id to the CUDA or AMD intrinsic. (we can also translate to CPU too)

@ivanradanov
Copy link
Collaborator

As for -S, you only need that if you want to see the MLIR at the point after it was raised to affine. Otherwise, it will be raised to affine, some optimizations done on that level and then lowered down to llvm and so on.

@nyck33
Copy link
Author

nyck33 commented Apr 4, 2024

Thanks @ivanradanov for all the tips. This is an amazing library bridging the gap for people like me with some Cuda background learning mlir.
For any other beginners like me, I just discovered this in the MLIR discord channel:
https://github.com/j2kun/mlir-tutorial so that's next for me.

@pal-stdr
Copy link

pal-stdr commented Apr 8, 2024

Hello @ivanradanov

Your issue here is that you need to point Polygeist to the correct clang resource dir, which would be $LLVM_BUILD_DIR/lib/clang/18 using --resource-dir=$LLVM_BUILD_DIR/lib/clang/18.

Sorry to interrupt! I have just two questions which I think related to this issue thread.

  1. Now Polygeist supports Clang 18.1.x? When I am writing this, i can see the latest llvm release is 18.1.3.
  2. If the Fix compilation error #392 fixed the compilation issue for CUDA 12.1, that means it should also work for CUDA 12.2. Am I correct?

Thanks in advance!

@ivanradanov
Copy link
Collaborator

We do not support llvm release 18. The 18 in that string indicates that release 18 is in development (or released) on that specific commit.

We only support the specific llvm commit the git submodule points to.

I believe it may work for cuda 12, but that specific PR you linked is unrelated and on our ROCm end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants