Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

SceneCityDev · 2024-04-03T20:32:43Z

It looks like the Intel Quicksync silicone and/or the Intel GPU drivers have some bugs, where sometimes (after a couple of days) the GPU can get stuck.

In this case in decoder_avc_qsv.cpp will trigger:

logte("An error occurred while sending a packet for decoding: Unhandled error (%d:%s) ", ret, err_msg);

However, once this happens, it happens forever, so you get an endless loop of this unhandled error. The only way to recover is to restart OME.

So, in reality, this is a fatal error.

I "fixed" this simply by adding an exit(1) line after that. This way systemd will handle re-starting OME.

The way it is right now is bad - a fatal error is completely ignored, and there is no way to monitor this - the monitoring API claims that all is fine. IMHO unhandled errors in a encoder/decoder should at least cause kill_flag to be set, or, even safer, OME terminating.

The text was updated successfully, but these errors were encountered:

getroot · 2024-04-04T06:08:31Z

Thank you for reporting the issue.
In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash.

I think improving this will take a long time.

irlkitcom · 2024-04-05T15:26:34Z

Thank you for reporting the issue.
In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash.
I think improving this will take a long time.

We used to use NVIDIA Tesla P4's for encoding and they would crash almost daily, no amount of driver updates ever fixed it.

SceneCityDev · 2024-04-05T18:35:46Z

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

irlkitcom · 2024-04-06T00:20:03Z

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

Sorry, I should have stated that this wasn't with OME, this was two custom systems that ran on Proxmox first and then Windows and they both had issues. On Linux, you could recover without a reboot but on Windows it often caused a Blue Screen, it's been awhile and I don't have the hardware or software anymore so I cannot test.

getroot assigned Keukhan Apr 4, 2024

getroot added the improvement Suggested change or modification to enhance user experience label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

SceneCityDev commented Apr 3, 2024

getroot commented Apr 4, 2024

irlkitcom commented Apr 5, 2024 •

edited

SceneCityDev commented Apr 5, 2024

irlkitcom commented Apr 6, 2024 •

edited

Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

Comments

SceneCityDev commented Apr 3, 2024

getroot commented Apr 4, 2024

irlkitcom commented Apr 5, 2024 • edited

SceneCityDev commented Apr 5, 2024

irlkitcom commented Apr 6, 2024 • edited

irlkitcom commented Apr 5, 2024 •

edited

irlkitcom commented Apr 6, 2024 •

edited