Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL errors for larger matrices w/ NVIDIA implementation #183

Open
jeffhammond opened this issue Jun 16, 2017 · 2 comments
Open

OpenCL errors for larger matrices w/ NVIDIA implementation #183

jeffhammond opened this issue Jun 16, 2017 · 2 comments
Assignees
Labels

Comments

@jeffhammond
Copy link
Member

OpenCL transpose breaks with matrices of rank 1296 or greater with the NVIDIA OpenCL implementation. This is NVIDIA-specific, because the Intel OpenCL is fine for much larger matrices.

It is possible that there is something that I can query to know in advance that this problem will appear. CL_DEVICE_ADDRESS_BITS exists but if the problem is 32b indexing, that should not manifest at 1296 (which is only 12.8 MiB).

jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1295
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platform: NVIDIA CUDA
Available OpenCL platform: Intel(R) OpenCL
Matrix order          = 1295
Number of iterations  = 10
Solution validates
Rate (MB/s): 12611.9 Avg time (s): 0.00106378
jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1296
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platform: NVIDIA CUDA
Available OpenCL platform: Intel(R) OpenCL
Matrix order          = 1296
Number of iterations  = 10
ERROR: Aggregate squared error 1896 exceeds threshold 1e-08
@jeffhammond jeffhammond self-assigned this Jun 16, 2017
@jeffhammond
Copy link
Member Author

I see the same thing in https://github.com/jeffhammond/PRK/blob/9fdcc953e8a962a9d13508e3a3a092c07c05fd45/Cxx11/transpose-cuda.cu so it is presumably a problem with the low-level implementation.

@jeffhammond
Copy link
Member Author

With CUDA 8.0, I don't see these issues any more, at least with OpenCL.

jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1296
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platforms: 
CL_PLATFORM_NAME=NVIDIA CUDA, CL_PLATFORM_VENDOR=NVIDIA Corporation (DEFAULT)
   CL_DEVICE_NAME=GeForce GTX 960
   CL_DEVICE_VENDOR=NVIDIA Corporation
   CL_DEVICE_AVAILABLE=1
   CL_DEVICE_TYPE=GPU
   CL_DEVICE_MAX_COMPUTE_UNITS=8
   CL_DEVICE_GLOBAL_MEM_SIZE=2090270720
   CL_DEVICE_MAX_CLOCK_FREQUENCY=1228
   CL_DEVICE_MAX_MEM_ALLOC_SIZE=522567680
   CL_DEVICE_LOCAL_MEM_SIZE=49152
   CL_DEVICE_EXTENSIONS contains cl_khr_fp64

CL_PLATFORM_NAME=Intel(R) OpenCL, CL_PLATFORM_VENDOR=Intel(R) Corporation
   CL_DEVICE_NAME=Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
   CL_DEVICE_VENDOR=Intel(R) Corporation
   CL_DEVICE_AVAILABLE=1
   CL_DEVICE_TYPE=CPU
   CL_DEVICE_MAX_COMPUTE_UNITS=16
   CL_DEVICE_GLOBAL_MEM_SIZE=16645246976
   CL_DEVICE_MAX_CLOCK_FREQUENCY=3000
   CL_DEVICE_MAX_MEM_ALLOC_SIZE=4161311744
   CL_DEVICE_LOCAL_MEM_SIZE=32768
   CL_DEVICE_EXTENSIONS contains cl_khr_fp64

Matrix order          = 1296
Number of iterations  = 10
CPU Precision         = 64-bit
Solution validates
Rate (MB/s): 15035.8 Avg time (s): 0.00178733
GPU Precision         = 64-bit
Solution validates
Rate (MB/s): 20127.7 Avg time (s): 0.00133517

@jeffhammond jeffhammond added OpenCL and removed OpenCL labels Jan 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant