Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Hardware Encoder/Decoder errors should cause OME to exit #1584

Open
SceneCityDev opened this issue Apr 3, 2024 · 4 comments
Open
Assignees
Labels
improvement Suggested change or modification to enhance user experience

Comments

@SceneCityDev
Copy link

It looks like the Intel Quicksync silicone and/or the Intel GPU drivers have some bugs, where sometimes (after a couple of days) the GPU can get stuck.

In this case in decoder_avc_qsv.cpp will trigger:

logte("An error occurred while sending a packet for decoding: Unhandled error (%d:%s) ", ret, err_msg);

However, once this happens, it happens forever, so you get an endless loop of this unhandled error. The only way to recover is to restart OME.

So, in reality, this is a fatal error.

I "fixed" this simply by adding an exit(1) line after that. This way systemd will handle re-starting OME.

The way it is right now is bad - a fatal error is completely ignored, and there is no way to monitor this - the monitoring API claims that all is fine. IMHO unhandled errors in a encoder/decoder should at least cause kill_flag to be set, or, even safer, OME terminating.

@getroot
Copy link
Sponsor Member

getroot commented Apr 4, 2024

Thank you for reporting the issue.
In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash.

I think improving this will take a long time.

@getroot getroot added the improvement Suggested change or modification to enhance user experience label Apr 4, 2024
@irlkitcom
Copy link
Contributor

irlkitcom commented Apr 5, 2024

Thank you for reporting the issue.
In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash.
I think improving this will take a long time.

We used to use NVIDIA Tesla P4's for encoding and they would crash almost daily, no amount of driver updates ever fixed it.

@SceneCityDev
Copy link
Author

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

@irlkitcom
Copy link
Contributor

irlkitcom commented Apr 6, 2024

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

Sorry, I should have stated that this wasn't with OME, this was two custom systems that ran on Proxmox first and then Windows and they both had issues. On Linux, you could recover without a reboot but on Windows it often caused a Blue Screen, it's been awhile and I don't have the hardware or software anymore so I cannot test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Suggested change or modification to enhance user experience
Projects
None yet
Development

No branches or pull requests

4 participants