Whispering (in FRENCH) keeps missing a sentence, despite separating music from vocals. What can I do? #2170
Unanswered
NormalFall0
asked this question in
Q&A
Replies: 3 comments 6 replies
-
I used the following with the model.transcribe(vfile, language="fr", fp16=False, patience=2, beam_size=5, task='transcribe',
best_of=5, word_timestamps=True, verbose=False, suppress_tokens="") ######## |
Beta Was this translation helpful? Give feedback.
3 replies
-
@NormalFall0 Here is one more attempt using a quantized medium model as described in #2009 and speaker diarization. It is more resistant to hallucination, just FYI. |
Beta Was this translation helpful? Give feedback.
1 reply
-
This kind of process can be obtained in an easier way like this:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The video has lot of large silences. The whisper keep hallucinating, and even worse it even skipped 10 seconds of talk where it was hallucinating from the previous silence and continuing throughest the second silence.
The video is here (https://www.youtube.com/watch?v=i1ZbRFvfjI8)
I tried the model "Medium", because I have always had better resultrs with medium than large.
Langage: FR.
The biggest hallucinations happen between : 4:14 (starting from 4:31) to 5:02, then from 5:57 to 8:14, where its "silence - talk - silence" but the silence is MISSED and replaced with hallucination.
Can someone try or tell me how I can improve the results?
If you want to compare, try to see if you get this part:
The sentence missed is at 6:36!
This sentence was skipped both with medium model and Large model.
Can someone do better or show me how to? THX (You must find a way to make whisper generate the sentence at 6:36 at least)
I also would love to skeep the SRT timing as they are AS MUCH as possible.
Beta Was this translation helpful? Give feedback.
All reactions