Language Detection using large-v3 #2167

hozRifai · 2024-05-08T05:16:50Z

hozRifai
May 8, 2024

I'm currently using Whisper Large V3 and I'm encountering two main issues with the pipeline shared on HuggingFace:

If the audio has 2 languages, sometimes it processes them without issue, but other times it requires me to select one language. To solve this issue, I need to transcribe the audio in 2 languages separately and then do some post processing. To do so, I need a way to detect the languages present in the audio.
Also, For certain languages like Persian and Urdu (and possibly others), I must explicitly specify the language.

I am using the pipeline here, but there is no way I can detect the language, and checking the transcribe function here, I cant find a way to explicitly specify the language, I am not sure what to do in this case!

{ "detail": "Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing language=... or make sure all input audio is of the same language." }

glangford · 2024-05-08T11:27:53Z

glangford
May 8, 2024

Here is one approach to solve transcription with multiple languages (sample source code in the link)

pyannote is used to identify different speakers and the start/stop times for each. Then whisper identifies the language spoken before transcribing the audio segments from each individual speaker. In this example code, the speaker identities are ignored and the language is detected separately for each audio segment in isolation.

Multi-Language Audio and Transcription Inconsistencies #2009 (comment)

Other possibilities to consider:

2 replies

hozRifai May 8, 2024
Author

@glangford If I understand your solution right, you are trying to use a speaker diarization model, group each speaker by its time. and then pass it to whisper.

I have some issues with this approach since one of my use cases my be phone calls:

The context will be lost between 2 speakers ( I am still having mono channels here tho )
The speaker diarization model is showing me some bad time slots sometimes so I cant really use it.

Can you think of any other solution.

Thank you

glangford May 8, 2024

If I understand your solution right, you are trying to use a speaker diarization model, group each speaker by its time. and then pass it to whisper

Yes.

I have some issues with this approach since one of my use cases my be phone calls: The context will be lost between 2 speakers ( I am still having mono channels here tho ), The speaker diarization model is showing me some bad time slots sometimes so I cant really use it.

Both speaker diarization and whisper can make errors in timing; for the diarization part you could filter time slots that don't make sense (eg. shorter than some time threshold). You could also merge time slots where the same language is spoken to not lose context, or provide relevant context across segments using --initial_prompt.

As a general comment, I would consider testing against large-v2 as well.

Another possible solution might be AssemblyAI; the Universal-1 model is multilingual but the current set of languages is more limited.

https://www.assemblyai.com/blog/announcing-universal-1-speech-recognition-model/

hozRifai · 2024-05-08T17:45:21Z

hozRifai
May 8, 2024
Author

I have indeed implemented a long post processing script just to unify the time slots between the speaker diarization, transcription and translation. It's working very well indeed, but in some cases ( 2 speakers are talking in the same ) since each model is giving different timestamps!

I am just afraid that chunking based on the timestamps of the speaker diarization will reduce the accuracy for sure. Anyway, It looks like this is the best I can do!

Regarding AssemblyAI, I am looking for an offline solution indeed, so that's not gonna help much!

Thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Detection using large-v3 #2167

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Language Detection using large-v3 #2167

hozRifai May 8, 2024

Replies: 2 comments · 2 replies

glangford May 8, 2024

hozRifai May 8, 2024 Author

glangford May 8, 2024

hozRifai May 8, 2024 Author

hozRifai
May 8, 2024

Replies: 2 comments 2 replies

glangford
May 8, 2024

hozRifai May 8, 2024
Author

hozRifai
May 8, 2024
Author