Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermitted SIGSEGV errors crashing heavyDB #819

Open
anirudh-here-com opened this issue Dec 8, 2023 · 6 comments
Open

Intermitted SIGSEGV errors crashing heavyDB #819

anirudh-here-com opened this issue Dec 8, 2023 · 6 comments

Comments

@anirudh-here-com
Copy link

Version: 6.4.0
While running some queries against heavydb, SIGSEGV errors occur randomly causing the DB to crash and create outages.
Any way to debug/fix this?
HeavyDB.cpp:332 Interrupt signal (11) received.

@cdessanti
Copy link
Contributor

Could you please share the product logs? The logs can be found in the storage directory, typically located at /var/log/heavyai/storage/log. They are named as heavydb.INFO.*

It is essential to check the logs to identify the problem. Is there a specific reason why you're using version 6.4 when versions 7.0 and 7.1 are available?

@anirudh-here-com
Copy link
Author

I have done detailed analysis of this issue and found the issue..
This happens when I do a select_ipc_gpu on the database and it returns 0 records..

@anirudh-here-com
Copy link
Author

anirudh-here-com commented Dec 11, 2023

This can be easily replicated by using heavyai lib's select_ipc_gpu function to

import heavyai
conn=heavyai.connect(user=<user>, password=<pass>, dbname=<dbname>)
conn.select_ipc_gpu(<any select query which returns 0 rows>)
//thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

The reason we're using 6.4 is because we have some custom patches for our usecases

Is this fixed on the latest version 7.1?
If so, we might migration to the latest version

Thanks,

@cdessanti
Copy link
Contributor

Hi,

Thanks for reporting the issue. I will try to reproduce it on our end. If I am successful, I will create an internal case for our engineering team to fix the problem.

Can you try running your application without using the GPU-shared memory as a temporary solution?

Also, I am interested in your modifications to the database to support your application. Could you please share what they are doing?

Best regards,
Candido

@anirudh-here-com
Copy link
Author

Thanks for your reply.
Unfortunately using gpu shared memory is required and cannot be discarded.
Regarding the modifications, I plan to raise a pull request for the same.

Please let me know if you're able to replicate the issue on your end.
Thanks,
Anirudh

@cdessanti
Copy link
Contributor

Hi,

Using CUDA 11.8 and the latest version of GA (7.2.1), I was able to reproduce the issue on my end. I have created an internal ticket to resolve the issue.

I'll come back here whenthe problem is fixed. (

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants