Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent critical failure of audio services #2988

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

krisgesling
Copy link
Contributor

Description

If an audioservice throws an exception when starting playback, the service_lock is never released. Playback cannot be initiated again until the Audio Service is restarted.

I haven't yet found exactly what's failing, and hence how to replicate the issue. Supposedly VLC was not entirely crashing, instead hanging and blocking forever in one of three methods in self.play():

        selected_service.clear_list()
        selected_service.add_list(tracks)
        selected_service.play(repeat)

Thanks to @JarbasAI @HelloChatterbox for catching this.

How to test

Need to find a way to reliably reproduce it first...

Contributor license agreement signed?

  • CLA

@pep8speaks
Copy link

pep8speaks commented Aug 31, 2021

Hello @krisgesling! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-31 20:09:49 UTC

@devops-mycroft devops-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Aug 31, 2021
@forslund
Copy link
Collaborator

forslund commented Aug 31, 2021

The context manager for the lock handles the exception automatically so the explicit exception handling is not really necessary. The change looks good though, handling broken service objects 👍

Buuut if there is a hang somewhere neither context manager or explicit exceptions would help...

LOG.debug(service.name + ' audioservice would be preferred')
break
except Exception:
LOG.error(f'Failed to parse audio service name: {service}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Maybe the wording can be improved? This sounds a bit like service is expected to be a string that's parsed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to:
"Failed to read audio service name: {service}"

@devops-mycroft
Copy link

devops-mycroft commented Aug 31, 2021

Voight Kampff Integration Test Succeeded (Results)

@forslund
Copy link
Collaborator

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

If a plugin throws an exception when starting playback the
service_lock is never released. Playback can not be initiated
again until the audio service is restarted.

Thanks to @JarbasAI @HelloChatterbox for catching this.
@krisgesling
Copy link
Contributor Author

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

Yeah, thought I'd do that in a separate PR. Potentially pull the service out to a plugin at the same time.

Handler for mycroft.audio.service.play

Args:
message: message bus message
"""
with self.service_lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @forslund mentioned, using this context manager means that exceptions thrown within this block will correctly release the lock. Better exception handling and logging good, but it's not actually addressing the reported issue.

More likely, there needs to be a timeout parameter (with a sane default) for the play methods that are calling out to separate services/processes. If the code blocks indefinitely on an external call, there's nothing to make this lock release short of a service restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original issue was indeed vlc itself hanging forever and needing a restart, i wish it threw an exception 😅

@mcdonc
Copy link

mcdonc commented Sep 13, 2021

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

Yeah, thought I'd do that in a separate PR. Potentially pull the service out to a plugin at the same time.

It might be the place to start. 1.1.2 was released in 2015 (https://pypi.org/project/python-vlc/1.1.2/ ) and the most current release was April 2021.

@mcdonc
Copy link

mcdonc commented Sep 14, 2021

FWIW, I locally updated python-vlc to the latest (python-vlc==3.0.12118), changed the default audio backend to vlc, and it indeed played the news and passed audiobackend unit tests successfully at least using the newer lib. I know this doesn't mean much for testing purposes, as the unit tests use a totally mocked vlc; I'll see if I can get the VK tests running in a config that prefers to use vlc.

Another potential workaround for hangs would be to run the service processes using Supervisor (http://supervisord.org). It has an event system that can be hooked to control processes that run under it. If, say, each service emitted some sort of "I'm-still-alive" event every so often, a supervisor event listener could be written that, were it to not receive such an event within some timeframe, it could kill and restart the service. A example of such an event listener is "superlance", which attempts to get a 200 OK from an http server every so often, and if it doesn't, will restart one or more processes that are run under supervisor. https://github.com/Supervisor/superlance/blob/master/superlance/httpok.py .

Anyway, introduction of supervisor would be a bit invasive, but its introduction might be practical in the face of buggy libs, tough-to-test edge cases and whatnot.

@mcdonc
Copy link

mcdonc commented Sep 15, 2021

I created a branch in a fork of mycroft-core to document attempting to run the vk tests using the newer python-vlc lib version here: https://github.com/MycroftAI/mycroft-core/compare/dev...mcdonc:feature/newpythonvlc?expand=1

These tests passed, FWIW.

@mcdonc
Copy link

mcdonc commented Sep 15, 2021

And here's another branch as a proof of concept to start "all" services under supervisor, just in case it interested you.

https://github.com/MycroftAI/mycroft-core/compare/dev...mcdonc:feature/supervisorservices?expand=1

It does not yet implement an eventlistener to restart hung services, but it would restart services when they crash. It also rotates service log files as necessary.

@krisgesling
Copy link
Contributor Author

Thanks @mcdonc, I hadn't seen supervisord before, and haven't had a good look through the package yet, but we had been thinking about whether to go down a similar path:
https://docs.google.com/document/d/1REXWaUt1HUrbyaunB09cfQ435krdNwhjVbMVVG3iNKs/edit#

We haven't started any work on it yet, and a big question for me is how much we can leverage what systemd already does vs what we need to control in a different way etc.

I'd be interested in any thoughts you have on it.

@mcdonc
Copy link

mcdonc commented Sep 29, 2021

systemd user services would work just as well to restart crashed services, I think. supervisord was written before systemd existed, and has been maintained since then pretty well; I have used it on many projects but haven't really kept up with features added to systemd that it didn't have during that period. In particular, I don't know if systemd has the equivalent of supervisor's event listener system (http://supervisord.org/events.html) and log rotation. If it does, I'd go with systemd, unless it's of some maintenance benefit for the supervisor to be wrirtten in Python. Full disclosure: I am the original author of supervisord.

In any case, having some parent process restart crashed services would be a good start. I'm also not sure if the current configuration rotates log files; that can be a pain to configure in Python logging.

But having read your spec document, I definitely would not write another bespoke process manager.

EDIT: I looked at Jinx's comment to your process spec document and it seems that he has created a set of systemd units for Mycroft at https://github.com/j1nx/mycroft-systemd that looks promising by its description. I don't quite understand the "sd_notify" call nor the "software watchdog" he mentions in the document, but it likely covers most of what supervisor's event listener system might for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors)
Projects
No open projects
Status: High priority
Development

Successfully merging this pull request may close these issues.

None yet

7 participants