Whispering (in FRENCH) keeps missing a sentence, despite separating music from vocals. What can I do? #2170

NormalFall0 · 2024-05-09T16:13:56Z

NormalFall0
May 9, 2024

The video has lot of large silences. The whisper keep hallucinating, and even worse it even skipped 10 seconds of talk where it was hallucinating from the previous silence and continuing throughest the second silence.
The video is here (https://www.youtube.com/watch?v=i1ZbRFvfjI8)

I tried the model "Medium", because I have always had better resultrs with medium than large.
Langage: FR.

The biggest hallucinations happen between : 4:14 (starting from 4:31) to 5:02, then from 5:57 to 8:14, where its "silence - talk - silence" but the silence is MISSED and replaced with hallucination.

Can someone try or tell me how I can improve the results?
If you want to compare, try to see if you get this part:
The sentence missed is at 6:36!

This sentence was skipped both with medium model and Large model.

Can someone do better or show me how to? THX (You must find a way to make whisper generate the sentence at 6:36 at least)
I also would love to skeep the SRT timing as they are AS MUCH as possible.

glangford · 2024-05-10T11:48:27Z

glangford
May 10, 2024

I used the following with the large-v2 model to get an improvement (including the sentence at 6:36, and no hallucination at 4:31) but the resulting .srt still needs manual editing. The .srt is included below.

model.transcribe(vfile, language="fr", fp16=False, patience=2, beam_size=5, task='transcribe',
                              best_of=5, word_timestamps=True, verbose=False, suppress_tokens="") ########

Viande.srt.txt

3 replies

glangford May 10, 2024

large-v3 seems to hallucinate more than large-v2, and that may be why you were seeing better success with the medium model (specifying just large implies large-v3). Note the use of the suppress_tokens option as well.

NormalFall0 May 11, 2024
Author

Hi Glan (This is a new reply (I gave you another reply below if you check)

What is this supress tokens thing, shoudl I use it?
I just tried large-v2 (without pyanote code etc) and it indeed did not hallucinate and included the sentence at 6:36 thank you very much! I was knda worrief that I had to cut the video into multiple vidoes then whister the ones without silence, then reassemble them and try to keep the srt timestamps similar to the first vide uncut etc, you saved me:)! Thnak you!
I wonder why is large v2 slower?
Finally do you think I should use PYannote? Or is it only when you have mutliple speakers?

glangford May 11, 2024

What is this supress tokens thing, should I use it?

See this discussion; in addition to quotes it identifies background sounds [music] and seems to reduce hallucinations
#844

I wonder why is large v2 slower?

I haven't tested the performance of v2 vs v3, I just use large-v2 on CUDA (Colab) and quantized medium on CPU (locally)

Finally do you think I should use PYannote? Or is it only when you have mutliple speakers?

The pyannote example I posted was for transcribing multiple languages within the same audio. If you have a single language, you can just use whisper.

glangford · 2024-05-10T21:01:01Z

glangford
May 10, 2024

@NormalFall0 Here is one more attempt using a quantized medium model as described in #2009 and speaker diarization. It is more resistant to hallucination, just FYI.

#2009 (comment)

Viande-diarized.srt.txt

1 reply

NormalFall0 May 10, 2024
Author

Is there something more powerful than reacting and saying thank you? Because I am soo grateful and I appreciate so much that you came here to write yet another answer! Love it!
I will try the code you provided and hopefully I get it right. I have my own code right now which is a simple command, that is run when I press a simple tkinter button. Not used to pyanote (only tried it once or twice) but I hope I get is easy. Otherwise Iwill leave a comment.
Thanks again Cant wait to try it later

EtienneAb3d · 2024-05-12T06:50:04Z

EtienneAb3d
May 12, 2024

I just tried large-v2 (without pyanote code etc) and it indeed did not hallucinate and included the sentence at 6:36 thank you very much! I was knda worrief that I had to cut the video into multiple vidoes then whister the ones without silence, then reassemble them and try to keep the srt timestamps similar to the first vide uncut etc, you saved me:)! Thnak you!

This kind of process can be obtained in an easier way like this:

Use WhisperHallu to get both a clean text (combining voice extraction, silence removal, voice compression, etc) and clean timestamps (possibly with hallucinations).
https://github.com/EtienneAb3d/WhisperHallu
use WhisperTimeSync to sync the good text with the good timestamps.
https://github.com/EtienneAb3d/WhisperTimeSync

2 replies

NormalFall0 May 12, 2024
Author

Is it possible to make it have no hallucinations?

EtienneAb3d May 12, 2024

Hallucinations are mostly produced on silence or noise parts. To remove almost all hallucinations, you need to use voice extraction (Demux), non-voice parts removal (Silero VAD), silence parts removal (ffmpeg).
You may still get hallucinations on short audio. The Marker processing of WhisperHallu will do the job on most cases.
Finally, loudness normalization and speech compression (ffmpeg) will bring you we a higher recognition quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whispering (in FRENCH) keeps missing a sentence, despite separating music from vocals. What can I do? #2170

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Whispering (in FRENCH) keeps missing a sentence, despite separating music from vocals. What can I do? #2170

NormalFall0 May 9, 2024

Replies: 3 comments · 6 replies

glangford May 10, 2024

glangford May 10, 2024

NormalFall0 May 11, 2024 Author

glangford May 11, 2024

glangford May 10, 2024

NormalFall0 May 10, 2024 Author

EtienneAb3d May 12, 2024

NormalFall0 May 12, 2024 Author

EtienneAb3d May 12, 2024

NormalFall0
May 9, 2024

Replies: 3 comments 6 replies

glangford
May 10, 2024

NormalFall0 May 11, 2024
Author

glangford
May 10, 2024

NormalFall0 May 10, 2024
Author

EtienneAb3d
May 12, 2024

NormalFall0 May 12, 2024
Author