-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584
Comments
Thank you for reporting the issue. I think improving this will take a long time. |
We used to use NVIDIA Tesla P4's for encoding and they would crash almost daily, no amount of driver updates ever fixed it. |
Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done? If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages. Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"? |
Sorry, I should have stated that this wasn't with OME, this was two custom systems that ran on Proxmox first and then Windows and they both had issues. On Linux, you could recover without a reboot but on Windows it often caused a Blue Screen, it's been awhile and I don't have the hardware or software anymore so I cannot test. |
It looks like the Intel Quicksync silicone and/or the Intel GPU drivers have some bugs, where sometimes (after a couple of days) the GPU can get stuck.
In this case in decoder_avc_qsv.cpp will trigger:
logte("An error occurred while sending a packet for decoding: Unhandled error (%d:%s) ", ret, err_msg);
However, once this happens, it happens forever, so you get an endless loop of this unhandled error. The only way to recover is to restart OME.
So, in reality, this is a fatal error.
I "fixed" this simply by adding an exit(1) line after that. This way systemd will handle re-starting OME.
The way it is right now is bad - a fatal error is completely ignored, and there is no way to monitor this - the monitoring API claims that all is fine. IMHO unhandled errors in a encoder/decoder should at least cause kill_flag to be set, or, even safer, OME terminating.
The text was updated successfully, but these errors were encountered: