Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in CCMpred (CUDA) #6

Open
fsimkovic opened this issue Oct 13, 2016 · 4 comments
Open

Bug in CCMpred (CUDA) #6

fsimkovic opened this issue Oct 13, 2016 · 4 comments

Comments

@fsimkovic
Copy link

fsimkovic commented Oct 13, 2016

Running CCMpred with a sequence alignment in a CUDA compiled version of CCMpred gives crashes sometimes. Error give:

adenine: felix > ccmpred alignments/1bdo.jones 1bdo.mat
Found 1 CUDA devices, using device #0: Quadro K4000
Total GPU RAM:      3,217,752,064
Free GPU RAM:       2,617,708,544
Needed GPU RAM:       792,606,940 ✓
CUDA error No. 0 in /opt/CCMpred/src/evaluate_cuda_kernels.cu at line 819

Running the same command with flag -t 2 runs fine.

@sseemayer sseemayer added the bug label Oct 13, 2016
@sseemayer
Copy link
Contributor

Hi Felix, I don't have access to a suitable GPU/computer combination to debug this at the moment so I'm afraid that I will not be able to help 😞

@fsimkovic
Copy link
Author

No worries, the CPU version works fine so there's no rush. Just thought I'd report it ...

@tianmingzhou
Copy link

I encountered a similar error. The reason seems to be that I fed CCMpred with too much sequences (~70k). (The error code I got was 6.)
Besides, the macro CHECK_ERR(err) defined in include/evaluate_cuda_kernels.h and lib/libconjugrad/include/conjugrad_kernels.h (and maybe other files) may call cudaGetLastError() multiple times, like those in src/evaluate_cuda_kernels.cu, after expansion. The problem is, referring to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#group__CUDART__ERROR_1g3529f94cb530a83a76613616782bd233, the error code will have been reset to cudaSuccess when output. So we always get "CUDA error No. 0". Something like https://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/ may be a solution.

@croth1 croth1 added the cuda label Jan 1, 2019
@kWeissenow
Copy link

This issue is still present, hiding error codes and always showing No. 0.
The reason being the error checking via
CHECK_ERR(cudaGetLastError());
which is not a function but a preprocessor macro defined as
#define CHECK_ERR(err) {if (cudaSuccess != (err)) { printf("CUDA error No. %d in %s at line %d\n", (err), __FILE__, __LINE__); exit(EXIT_FAILURE); } }
in evaluate_cuda_kernels.h, line 9. It therefore expands to call cudaGetLastError() two times, consuming the actual error code before displaying it.

I suggest to change the macro to
#define CHECK_ERR(err) { int e = (err); if (cudaSuccess != e) { printf("CUDA error No. %d in %s at line %d\n", e, __FILE__, __LINE__); exit(EXIT_FAILURE); } }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants