Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Home-Assistant pipeline for TTS and STT #195

Open
VladFlorinIlie opened this issue Jun 21, 2023 · 33 comments
Open

Use Home-Assistant pipeline for TTS and STT #195

VladFlorinIlie opened this issue Jun 21, 2023 · 33 comments

Comments

@VladFlorinIlie
Copy link

Currently Willow uses only the intent part of the HA assist pipeline.
It would be nice if users could choose if they want to use the entire pipeline (so use the TTS and STT provided by the pipeline).
Would this be a feature you might take into consideration?

@nikito
Copy link
Contributor

nikito commented Jun 22, 2023

I of course defer to the Willow devs on this, but I think one of the big problems with this is that the HA TTS/STT systems are just vastly slower than the WIS implementation using a GPU (cheap GTX 1070 in my case). I have both in my lab and for a real-world comparison, I can ask HA "What's the humidity outside?" (custom intent I wrote), and it takes nearly 7 seconds to do the STT and TTS using just the tiny-int8 model:
image

By comparison, the same request on WIS using the Medium model only took 1.27 seconds:
image
I've also tried using the Large v2 model with a Beam Size of 5 and even that can do it in around 1.5 - 2 seconds. For my needs I have yet to find anything the Medium model hasn't accurately done, including random requests like "Watch MoonMoonOW on Twitch" or "Tune to Disney Junior" as examples.

So even if it were possible to do this, I don't think it'd really be beneficial from a user experience standpoint.

Just my two cents, just figured I'd offer some real world comparisons between the two systems. 😄

EDIT: had my math wrong on the WIS part 😄

@kristiankielhofner
Copy link
Contributor

Thanks for the benchmark!

Use of HA approaches for STT/TTS is likely not something we're going to support (for the reasons you highlight and more). That said WIS in CPU mode on most x86_64 systems also dramatically outperforms the HA implementations.

We keep running into this issue and my general position (as harsh/absolutist as it may be) is the current HA approaches to voice, STT, TTS, etc just aren't practical. In seven seconds you can find your phone in your house, unlock it, open an app, and just do it there - assuming the tiny model even gets the transcription right in the first place (often doesn't). At that point you're looking at 14 seconds or more (total) to repeat yourself and there are many speech segments it will never transcribe accurately.

The wisng README has Raspberry Pi benchmarks (as one example) for the implementation HA uses and long story short apples-to-apples (tiny to tiny) a GTX 1070 is 90x faster just for STT... When you get to models that can provide commercially competitive quality (medium) it's at least 112x faster. The Raspberry Pi takes a whopping 51 seconds to do STT with 3.8 seconds of speech with medium on a Raspberry Pi!

It's not at all a fair fight but it's a dramatic demonstration of just how unpractical the Raspberry Pi approach is. At the risk of sounding like I'm being overly critical, HA is fantastic and Willow wouldn't do anything without it. But speech and ML tasks are just a completely different animal that Willow and WIS are highly targeted for.

If anything we'd go the other direction - a WIS component for HA so HA can use WIS as a STT/TTS endpoint elsewhere within HA.

@kevdliu
Copy link

kevdliu commented Jun 22, 2023

I totally see your concerns regarding the performance of on-device STT and TTS with Home Assistant. The beauty of HA though is that it allows you to customize the STT and TTS engines of the assist pipeline. I'm currently using Nabu Casa for both and the performance is very satisfactory. I realize that by using HA's full pipeline we would basically be bypassing the majority of the functionality of Willow so it's totally fair if you think that's not the goal of the project. Since Willow + ESP S3 Box is a readily available setup with wake word capability, I think a lot of people are looking to be able to fully integrate it with the assist pipeline.

@kristiankielhofner
Copy link
Contributor

My last sentence from the prior reply:

"If anything we'd go the other direction - a WIS component for HA so HA can use WIS as a STT/TTS endpoint elsewhere within HA."

This is the use of WIS elsewhere in the HA pipeline (Willow related or not) for STT/TTS within HA.

In terms of what you're describing, none of this is impossible (or even close to it). That said, we cannot be solely responsible for not only Willow, WIS, and WAS but also for Home Assistant, openHAB, Hubitat, and countless other native platforms and integrations people have asked for. That is better left up to those communities. We brought the best voice interface at the best price point with the highest performance and accuracy available in open source. This is our focus and if a community wants functionality like you describe I do not think it is too much to ask for it/them to put in a little effort to bring our base functionality to their platform of choice - in the way they want it.

I would love to see a Willow component with everything you describe and more in HA but it's not going to come from us.

@kevdliu
Copy link

kevdliu commented Jun 22, 2023

That's totally fair considering the goal of this project. Thanks for taking the time to explain.

@kevdliu
Copy link

kevdliu commented Jun 22, 2023

Looking at the code for Willow and WIS I imagine I can stand up some sort of proxy server on HA that pipes audio from Willow to the assist pipeline. The TTS part would be a lot harder to integrate since it looks like Willow is doing it on device for now. When/If the TTS output feature in the README is implemented that part shouldn't be too difficult either. I guess there goes all of my free time 😆. Anyway thank you and everyone else involved in making this project possible. Excited to see the future of local voice assistance.

@kristiankielhofner
Copy link
Contributor

This is generally the approach we are thinking of.

The idea (essentially) is:

  • Willow Application Server for management of devices, configuration, etc (devices maintain persistent websocket connection to WAS).
  • We will add a WAS protocol endpoint to Willow.
  • All communication (from management to audio) will move to WAS endpoint (when configured) and WAS proxy (or be provisioned from WAS for direct connection in higher scale scenarios - likely doesn't apply to community users).
  • WAS will also have HA, openHAB, REST, etc components (and more) to handle communication between Willow, WIS, command endpoint (HA), other applications, APIs, NLP processing for grammar matching to various integrations, configuring of STT/TTS engines, etc.

Willow Application Server is under development and for the 1.0 release it will only support the management and update functionality. We will then work on implementing command endpoint and audio proxy support in WAS.

In order to maintain our goal of being platform agnostic WAS itself can run standalone. However, it's relatively simple for just management, configuration and proxying audio so an HA component, add-on, something could essentially emulate WAS for use by Willow devices and integrate natively within HA. It would be very light and straightforward as HA already has intents, assist, TTS, STT, etc frameworks so the native HA WAS component would be dramatically simpler to implement than WAS itself - where we need to more-or-less duplicate and then expand the functionality that HA has today.

As you can probably tell this is all pretty early and while it may seem convoluted for this discussion it boils down to a Willow component in HA that not only emulates WAS for Willow management, audio, etc it does so by exposing WIS as TTS and STT within HA for full assist pipeline support so it becomes native and seamless. My continuing concern is that abstracting WIS TTS and STT via HA could lead to a reduction in user experience due to response time or any other unforeseen issues.

I think this will be perfectly acceptable to many HA users but as @nikito has illustrated the grammar, accuracy, speed, etc of Willow enables a lot of speech and command flows that are very convoluted, difficult, or outright impossible with HA intents (and just HA generally). There are basic voice assistant tasks like setting timers, reminders, asking the time, setting an alarm, checking a calendar, etc that don't really make any sense for HA to be involved in. It can actually get fairly difficult - consider the timer scenario... You could probably do something with HA scripts, etc but with WAS there will just be a timer app you include in the grammar plan to activate. When the timer is up (or an alarm, whatever) WAS sends an event to the Willow device and prints on display, plays audio, etc. Same for a Google Calendar or any of the other things people are doing with Alexa today.

We plan on integrating Rasa into WAS so we can build extremely flexible grammar with excellent NLU/NLP capabilities to avoid the awkward syntax that you have with Alexa Skills today "Alexa, Ask My Ford to turn the car on" and some of the limitations of HA. Rasa also adds significant capabilities in terms of session awareness, context, turn by turn, etc that enable all kinds of interesting agent possibilities, especially considering WIS also supports hosting of LLMs.

@A6blpka
Copy link

A6blpka commented Jun 24, 2023

I think @nikito comparison is wrong.
I created a small STT component for HA that accesses /api/willow.
As an alternative to WIS, I use my own solution (c#, vosk) that implements STT and TTS for Willow. It is the one that HA refers to for SST.
After a number of tries, I have come to the conclusion that the STT time shown by HA is the time with speech!
изображение
The picture shows my best result when I stop recording by clicking. Worst result with end-of-speech recognition from HA ~1.8 sec, average 1.4 sec
And live recording: https://www.youtube.com/watch?v=WrEhKChrYu4

@A6blpka
Copy link

A6blpka commented Jun 24, 2023

@kristiankielhofner
What about alarm clocks. I, for example, would like to open the curtains 5 minutes before the alarm.
And reminders, I can go to another room, or out of the house, then it should come to my phone.
There are lots of options for use. I think that everyone who wants to find them, but only when they have the chance.

I almost forgot the shopping list! Right now I just tell my voice assistant to add something to my shopping list and in the store I check the boxes on my phone, wonderful.

@faduchesne
Copy link

I think @nikito comparison is wrong.
I created a small STT component for HA that accesses /api/willow.
As an alternative to WIS, I use my own solution (c#, vosk) that implements STT and TTS for Willow. It is the one that HA refers to for SST.
After a number of tries, I have come to the conclusion that the STT time shown by HA is the time with speech!
изображение
The picture shows my best result when I stop recording by clicking. Worst result with end-of-speech recognition from HA ~1.8 sec, average 1.4 sec
And live recording: https://www.youtube.com/watch?v=WrEhKChrYu4

Can you share you’re component ?

@nikito
Copy link
Contributor

nikito commented Jun 24, 2023

I think @nikito comparison is wrong. I created a small STT component for HA that accesses /api/willow. As an alternative to WIS, I use my own solution (c#, vosk) that implements STT and TTS for Willow. It is the one that HA refers to for SST. After a number of tries, I have come to the conclusion that the STT time shown by HA is the time with speech! изображение The picture shows my best result when I stop recording by clicking. Worst result with end-of-speech recognition from HA ~1.8 sec, average 1.4 sec And live recording: https://www.youtube.com/watch?v=WrEhKChrYu4

Don't think anything is wrong with what I put? The time HA shows indeed includes TTS (believe the HA devs mentioned that when they went over this in the launch party). The total time was around 6 seconds. The same example using WIS, including TTS is 1.27 second. I also ensured I cleared the cache on both sides to eliminate that factor and make it a true test of both the STT and TTS. Note I also show that the time in HA is with the small model, while the time in WIS is with the medium model, which is more accurate on more complex sentences such as the examples I used. I'm also using the stock Whisper component in HA, as that is what the OP was referring to when mentioning using the assist pipelines STT and TTS. 🙂

@A6blpka
Copy link

A6blpka commented Jun 24, 2023

Can you share you’re component ?

Yeah, I'll publish it tomorrow

@A6blpka
Copy link

A6blpka commented Jun 24, 2023

Don't think anything is wrong with what I put? The time HA shows indeed includes TTS (believe the HA devs mentioned that when they went over this in the launch party). The total time was around 6 seconds. The same example using WIS, including TTS is 1.27 second. I also ensured I cleared the cache on both sides to eliminate that factor and make it a true test of both the STT and TTS. Note I also show that the time in HA is with the small model, while the time in WIS is with the medium model, which is more accurate on more complex sentences such as the examples I used. I'm also using the stock Whisper component in HA, as that is what the OP was referring to when mentioning using the assist pipelines STT and TTS. 🙂

I wasn't comparing the TTS. And the comparison with different models seems strange.

My point is that the complete STT cycle in the conditions of one model seems to be the same, both for the variant 'Willow -> WIS -> Willow -> HA' and for the variant: 'Willow -> HA -> SOME_STT -> HA'

@kristiankielhofner
Copy link
Contributor

kristiankielhofner commented Jun 24, 2023

@A6blpka - FYI on your Youtube comparison video - remember voice activity detection. When you're using the HA interface you end recording of speech with an instant mouse click. With Willow we have to wait a reasonable amount of time to ensure the speaker has finished speaking before we stop recording. Check advanced settings -> VAD Timeout. The default is 300ms which is pretty conservative. If you lower it (I use 100ms personally) it will detect the end of speech faster and the ultimate response time reduces by the difference in value.

Note that VAD is tricky - in your example video you are smoothly and clearly firing off a command you've probably repeated many times recently. In practice most voice commands aren't that smooth and users speak slowly, hesitate between words, etc so really low VAD times are usually only good for benchmarking.

In terms of alarm clocks, reminders, shopping lists, etc - from the perspective of Willow, WIS, and even WAS the sky is the limit! WAS will have modules/apps/integrations (python) to chain together any number of supported WIS features, APIs, WAS apps/integrations, command endpoints (HA, openHAB, etc).

Sorry, edit - would you be able to test our new WIS implementation? You can use it with WIS URL:

https://wisng.tovera.io/api/willow?force_language=ru

You also strike me as the kind of person that likes tweaking things. Make sure to check our WIS API documentation.

@kristiankielhofner
Copy link
Contributor

@nikito - Would you be interested in testing our first release? We have WAS, dynamic configuration, and OTA working but we would like to see more testing - especially with WAS deployment.

If so I'll add you to the repos and we can start issues and discussions across WIS/WAS/Willow.

@nikito
Copy link
Contributor

nikito commented Jun 25, 2023

@nikito - Would you be interested in testing our first release? We have WAS, dynamic configuration, and OTA working but we would like to see more testing - especially with WAS deployment.

If so I'll add you to the repos and we can start issues and discussions across WIS/WAS/Willow.

@kristiankielhofner absolutely! I have another unit on the way as well so I can test out multi deployment OTA 🙂

@A6blpka
Copy link

A6blpka commented Jun 25, 2023

@A6blpka - FYI on your Youtube comparison video - remember voice activity detection. When you're using the HA interface you end recording of speech with an instant mouse click. With Willow we have to wait a reasonable amount of time to ensure the speaker has finished speaking before we stop recording. Check advanced settings -> VAD Timeout. The default is 300ms which is pretty conservative. If you lower it (I use 100ms personally) it will detect the end of speech faster and the ultimate response time reduces by the difference in value.

Note that VAD is tricky - in your example video you are smoothly and clearly firing off a command you've probably repeated many times recently. In practice most voice commands aren't that smooth and users speak slowly, hesitate between words, etc so really low VAD times are usually only good for benchmarking.

In terms of alarm clocks, reminders, shopping lists, etc - from the perspective of Willow, WIS, and even WAS the sky is the limit! WAS will have modules/apps/integrations (python) to chain together any number of supported WIS features, APIs, WAS apps/integrations, command endpoints (HA, openHAB, etc).

Sorry, edit - would you be able to test our new WIS implementation? You can use it with WIS URL:

https://wisng.tovera.io/api/willow?force_language=ru

You also strike me as the kind of person that likes tweaking things. Make sure to check our WIS API documentation.

In the video, I only stop recording by clicking on the last attempt. Recording on the first attempt stops by HA VAD.

I have published a component to implement WIS as an HA STT. You can try it over HACS.
@faduchesne you asked about it.

@A6blpka
Copy link

A6blpka commented Jun 25, 2023

I have published a component to implement WIS as an HA STT. You can try it over HACS.

I tested it on a local WIS installation and on https://wisng.tovera.io

@faduchesne
Copy link

@A6blpka - FYI on your Youtube comparison video - remember voice activity detection. When you're using the HA interface you end recording of speech with an instant mouse click. With Willow we have to wait a reasonable amount of time to ensure the speaker has finished speaking before we stop recording. Check advanced settings -> VAD Timeout. The default is 300ms which is pretty conservative. If you lower it (I use 100ms personally) it will detect the end of speech faster and the ultimate response time reduces by the difference in value.
Note that VAD is tricky - in your example video you are smoothly and clearly firing off a command you've probably repeated many times recently. In practice most voice commands aren't that smooth and users speak slowly, hesitate between words, etc so really low VAD times are usually only good for benchmarking.
In terms of alarm clocks, reminders, shopping lists, etc - from the perspective of Willow, WIS, and even WAS the sky is the limit! WAS will have modules/apps/integrations (python) to chain together any number of supported WIS features, APIs, WAS apps/integrations, command endpoints (HA, openHAB, etc).
Sorry, edit - would you be able to test our new WIS implementation? You can use it with WIS URL:
https://wisng.tovera.io/api/willow?force_language=ru
You also strike me as the kind of person that likes tweaking things. Make sure to check our WIS API documentation.

In the video, I only stop recording by clicking on the last attempt. Recording on the first attempt stops by HA VAD.

I have published a component to implement WIS as an HA STT. You can try it over HACS. @faduchesne you asked about it.

how can configure it to have reply on my esp box ?

@A6blpka
Copy link

A6blpka commented Jun 25, 2023

how can configure it to have reply on my esp box ?

This component doesn't solve it

@A6blpka
Copy link

A6blpka commented Jun 25, 2023

If I understand Willow's work correctly, I see a flow like this:
изображение
If I'm thinking about HA integration, I see a flow like this:
изображение
@kristiankielhofner - Is this flow possible in implementing your view of Willow in where it is going? Is it possible in the future without WAS?

@kristiankielhofner
Copy link
Contributor

This is great work!

In terms of flow my sense is the cleanest and most robust implementation can only come from an HA integration done via WAS and/or WAS protocol (in concept stage) -OR- abstracted via WAS itself. This would allow for things like abstracting all of the components and flow to the point where Willow doesn't know the difference. One example:

Audio in after wake -> STT -> HA pipeline -> potential TTS output to audio -> response to WIllow, show command output and play TTS audio because response has audio. If no audio play success/failure tone (or play custom tones as provided by audio).

@A6blpka
Copy link

A6blpka commented Jun 26, 2023

WAS protocol (in concept stage)

Where can we discuss this?

@kristiankielhofner
Copy link
Contributor

I just sent you an invite to the (currently private) willow-application-server repo.

@oywino
Copy link

oywino commented Feb 6, 2024

I have published a component to implement WIS as an HA STT.

Question: Is this component the ONLY way to make Willow send commands to HA, or are there alternative ways to do this?
(I'm just asking in order to understand the landscape)

@nikito
Copy link
Contributor

nikito commented Feb 6, 2024

I have published a component to implement WIS as an HA STT.

Question: Is this component the ONLY way to make Willow send commands to HA, or are there alternative ways to do this? (I'm just asking in order to understand the landscape)

Willow natively supports sending commands to HA, the above has nothing to do with that. I think that is to use WIS as a STT component natively inside HA, which would really only be useful if you are trying to use the HA ESPHome implementation of assistants instead of willow (which may have mixed results and isn't the optimal way to do it from a willow perspective ;) )

@A6blpka
Copy link

A6blpka commented Feb 6, 2024

Question: Is this component the ONLY way to make Willow send commands to HA, or are there alternative ways to do this? (I'm just asking in order to understand the landscape)

No, that's not what you need.
It only sends the voice from HA to WIS and receives back the recognized text.

@oywino
Copy link

oywino commented Feb 6, 2024

Aha, So totally mistaken I was.
Thanks for clearing that up.
But why is it that when I try using Assist, I only get this error message after installing WIS:
image

@nikito
Copy link
Contributor

nikito commented Feb 6, 2024

WIS is an isolated ecosystem. It only works with willow. Take a look at heywillow.io to get an idea of how things work :)

@VladFlorinIlie
Copy link
Author

I think that this discussion does not follow the question I asked when I created this issue. The question was: can Willow (the software running on the S3 BOX) support the HA assist pipelines (effectively bypassing WIS) or not? From what I understood, Willow will not support this, so I think that this issue should be closed.

@oywino
Copy link

oywino commented Feb 7, 2024

With @A6blpka help I found my error; During installation of the HA WIS Integration from HACS, it automatically suggested a localhost URL to port 9000. I didn't understand that I had to replace this with https://infer.tovera.io/api/willow but after doing so, it works perfectly! So yes, the integration allows HA to use WIS.
Your question is quite the opposite; Use Willow to talk to the HA voice pipeline, right? But I don't understand why you would want to do that? Isn't it effectively reducing Willow to be nothing more than a microphone with a silly screen? If thats what you need, you can use the ESP32 Voice FW. It does exactly that, and more. Or are you seeking som additional features that I fail to understand?

@nikito
Copy link
Contributor

nikito commented Feb 7, 2024

Just so you know, ha wis integration is a fork and not the official wis. The fork has modifications to make it work in in this way, but is not the "officially" supported way to use wis; my answer was in the context of wis in our repo. 😉
That said it's open source so you can of course use it how you want, but the fork may not always have the latest features. 🙂

@oywino
Copy link

oywino commented Feb 7, 2024

My understanding of this is that the official WIS STT is used. The integration only pipes the voice stream from Assist to WIS instead of piping it to Whisper, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants