Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to send TTS reply sentence by sentence for longer text #326

Open
zmarty opened this issue Nov 5, 2023 · 1 comment
Open

How to send TTS reply sentence by sentence for longer text #326

zmarty opened this issue Nov 5, 2023 · 1 comment

Comments

@zmarty
Copy link

zmarty commented Nov 5, 2023

I have some custom REST API server code that currently replies with some string, and using WIS TTS the string gets read out on the ESP32 box. This is great for short replies, but not so great if the LLM needs to read a long text. Would it be somehow possible to send responses sentence by sentence as it gets streamed from the LLM I use? Something similar to how the voice version of ChatGPT gradually streams back the reply?

@zmarty
Copy link
Author

zmarty commented Nov 5, 2023

Maybe related observation: by default my C# ASP.NET API uses Transfer-Encoding: chunked and it does not return a Content-Length header. In that case willow just reads aloud "Success" instead of the body I send, because it fails to determine the length. If I change my code to force it to send Content-Length, then it reads the body correctly.

This got me thinking... could my request above be implemented using chunked transfer encoding?

Something like this proposal from GPT-4:

[HttpGet("stream")]
public async Task StreamResponse()
{
    Response.Headers.Add("Transfer-Encoding", "chunked");
    foreach (var part in GetDataParts())
    {
        await Response.WriteAsync(part);
        await Response.Body.FlushAsync(); // Important to flush the stream
        // Simulate some real-time delay or processing
        await Task.Delay(1000);
    }
}

private IEnumerable<string> GetDataParts()
{
    yield return "Part 1 ";
    yield return "Part 2 ";
    yield return "Part 3 ";
}

The difficulty is that then the ESP box would need to keep contacting the inference server to get audio for each separate sentence as in comes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant