Combining multiple mp3 files to be returned as a single MediaStreamTrack #1076

pushkarprasad007 · 2024-03-24T11:38:32Z

pushkarprasad007
Mar 24, 2024

I will be using LLM (like GPT) to generate an answer - which would then be converted to speech, which I want to send over to the browser using aiortc. However, since LLM take time to produce complete output, instead of waiting for it to complete, we can read partial answers as soon it appears, and every few words, generate mp3 file for those many words, and then stream those. So not all the mp3 files would be available immediately, and instead, I need to keep on adding them as soon they appear (say every 4-5 words) from LLM.

I wrote a custom MediaStreamTrack to achieve the same. I have tried this with 2 files, a.mp3 and b.mp3.

I ran across 2 issues:

The last few hundred millisecond of a.mp3 sound stretched.
The initial 2 (or so) second of b.mp3 go blank, and then after that it plays

Clearly, the addition of frames need to be done better so that this can work. I am definitely missing something here - would be great if someone can point me in the right direction.

class CombinedAudioTrack(MediaStreamTrack):
    """
    An audio track which reads from multiple mp3
    """
    kind = "audio"
    currentMediaPlayer:MediaStreamTrack = None
    queue = asyncio.Queue()
    _stop: float = False

    def __init__(self) -> None:
        super().__init__()
        # self.readyState = "live"

    def addNewMP3File(self, mp3File, last:bool = False):
        self.queue.put_nowait(mp3File)
        if last:
            self._stop = True
    
    async def getNextMediaStreamTrack(self):
        mp3File = await self.queue.get()
        self.currentMediaPlayer = MediaPlayer(os.path.join(ROOT, mp3File)).audio

    async def recv(self) -> Frame:
        print("Came in recv")
        try:
            # Should only happen first time
            if not self.currentMediaPlayer:
                await self.getNextMediaStreamTrack()    
            frame = await self.currentMediaPlayer.recv()
            print(frame)
            return frame
        except MediaStreamError:
            # Its time to move the current media player forward
            if(self._stop):
                # self.stop()
                raise MediaStreamError()       
            await self.getNextMediaStreamTrack()
            return await self.currentMediaPlayer.recv()

pushkarp-sharp · 2024-04-01T11:43:45Z

pushkarp-sharp
Apr 1, 2024

Any update?

0 replies

Antonyesk601 · 2024-04-18T16:26:46Z

Antonyesk601
Apr 18, 2024

Hey I have done something a bit similar and i suggest looking at using some form of getting you tts output in some stream friendly format. I went the easy route of getting all my audio as pcm16 16000. Makes life a lot easier. the downside is you might probably need to manage some stuff yourself like timestamps, frame chunking etc

As for your MediaStreamTrack, I think you might be running into an issue with the timestamps of the frames as it feels like the timestamps would reset on each new audio file you're reading which might confuse the recipient

1 reply

pushkarp-sharp Apr 18, 2024

Can you suggest a code for this please?

lalanikarim · 2024-05-07T03:15:18Z

lalanikarim
May 7, 2024

Hi @pushkarprasad007
I am working on a similar project involving LLMs and ran into similar issue.
Take a look at this implementation of a custom MediaStreamTrack https://github.com/lalanikarim/webrtc-ai-voice-chat/blob/main/playback_stream_track.py
I start with a one second silence wav file, which I re-queue until I have audio from my text to audio model. At that point I queue the audio from the generated audio. I then re-queue the silence track until I have the next piece of audio generated and ready to play.
Like @Antonyesk601 suggested, AudioFrames from new tracks always start at time 0. For instance, if I queue my audio file to play soon after the one second of silence wav file, I will not hear the first one second of audio from my generated file, since the MediaPlayer has already moved one second past the start of the audio.
You will need to increase the AudioFrame.pts for any new frames you append enough to bump the generated time value to the current time step.
I hope this help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining multiple mp3 files to be returned as a single MediaStreamTrack #1076

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Combining multiple mp3 files to be returned as a single MediaStreamTrack #1076

pushkarprasad007 Mar 24, 2024

Replies: 3 comments · 1 reply

pushkarp-sharp Apr 1, 2024

Antonyesk601 Apr 18, 2024

pushkarp-sharp Apr 18, 2024

lalanikarim May 7, 2024

pushkarprasad007
Mar 24, 2024

Replies: 3 comments 1 reply

pushkarp-sharp
Apr 1, 2024

Antonyesk601
Apr 18, 2024

lalanikarim
May 7, 2024