Implement voicefixer for audio enhancement #221

thieugiactu · 2023-11-24T11:07:29Z

Is there any way to implement voicefixer to speaker diarization pipeline?
The package takes a wav file as input and gives a upsampled 44100kHz wav file as output, but that could be easily modified to taking and giving audio numpy array.
Since the speaker embeddings depend greatly on the quality of the input audio and in the real world environment, there are a lot of factor that can affect the quality of the audio such as the quality of the recording device, speaker voice change overtime,... so I think having some audio quality enhancement is a must.

juanmc2005 · 2023-11-24T13:40:42Z

Hi @thieugiactu, that's an interesting idea.

To do this in a streaming way we would need access to a pre-trained model for the enhancement task, then implement a SpeechEnhancementModel and SpeechEnhancement block. This would allow you to build a pipeline where you call SpeechEnhancement before sending it to SpeakerSegmentation and SpeakerEmbedding.

In order to make this compatible with SpeakerDiarization (or any pipeline for that matter), we could implement a method like add_audio_preprocessors() to prepend any audio transformations (e.g. enhancement, resampling, volume change, etc.)

thieugiactu · 2023-11-25T10:27:22Z

I will give it a try. If I have any questions regarding to diart, can I directly ask them under this issue?

juanmc2005 · 2023-11-27T09:46:34Z

@thieugiactu sure! Feel free to open a PR too, I'd be glad to discuss possible solutions to this

thieugiactu · 2023-12-01T02:57:05Z

This is what I've been doing so far. I re-used your code but replaced whisper model with wav2vec2 model for speech recognition since my pc couldn't handle whisper.

The code worked but there are some adjustment could be made:

The process takes a really long time since the voicefixer model has to also process either the silence data with no speaker or in a same batch, there would be little to no differences between data.
The silence parts at the start and the end of the speech should be trimmed.
project.zip.
voicefixer probably need librosa==0.9.2 to run.

juanmc2005 · 2023-12-26T09:37:46Z

@thieugiactu something you could also do to reduce the inference time is to directly record audio at 44.1 khz. This way you avoid having to upsample in the first place

thaokimctu · 2023-12-26T14:45:54Z

@juanmc2005 thank you for your reply. Unfortunately the voicefixer is so unstable and I couldn't make it work properly. More often than not it would degrade the audio's quality even more.

juanmc2005 added the feature New feature or request label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement voicefixer for audio enhancement #221

Implement voicefixer for audio enhancement #221

thieugiactu commented Nov 24, 2023

juanmc2005 commented Nov 24, 2023

thieugiactu commented Nov 25, 2023

juanmc2005 commented Nov 27, 2023

thieugiactu commented Dec 1, 2023 •

edited

juanmc2005 commented Dec 26, 2023

thaokimctu commented Dec 26, 2023 •

edited

Implement voicefixer for audio enhancement #221

Implement voicefixer for audio enhancement #221

Comments

thieugiactu commented Nov 24, 2023

juanmc2005 commented Nov 24, 2023

thieugiactu commented Nov 25, 2023

juanmc2005 commented Nov 27, 2023

thieugiactu commented Dec 1, 2023 • edited

juanmc2005 commented Dec 26, 2023

thaokimctu commented Dec 26, 2023 • edited

thieugiactu commented Dec 1, 2023 •

edited

thaokimctu commented Dec 26, 2023 •

edited