Fine-tuning Whisper in more than one language #1432

andrespimartin · 2023-06-06T08:13:13Z

andrespimartin
Jun 6, 2023

Suppose I have a dataset in two or more languages (one of them under-represented in Whisper's pre-trained models), and I want to fine-tune those 2 or more languages to continue with a multilingual model and avoid catastrophic forgetting. Is fine-tuning possible?

Can I define the tokenizer and the processor without indicating the language?

phineas-pta · 2023-06-06T11:39:10Z

phineas-pta
Jun 6, 2023

i don't think you can fine tune in one-click but you can do sequentially one language after another

15 replies

andrespimartin Aug 17, 2023
Author

Hi @ILG2021

Sorry for the delay, I've had a busy week.
Your code seems correct to me. You don't need to do anything else: since the language label is added, the model will try to predict that label as well, so it will be computed in the loss calculation. I have used datasets with six languages with the samples randomly concatenated, and a batch size of 16 and it works fine.

Best!

ILG2021 Aug 17, 2023

the self.processor.tokenizer.pad in DataCollatorSpeechSeq2SeqWithPadding is language independent? and self.processor.tokenizer.batch_decode don't need to set_prefix_tokens?

tuandattt May 4, 2024

Hi, I use @ILG2021 code and just change my dataset but i still get this error:
ValueError: Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing language='...' or make sure all input audio is of the same language.

How can I solve this?

sproocht May 5, 2024

Hi @tuandattt ,
Try transformers version 4.37.2 (or earlier). I opened an issue about this limitation yesterday: huggingface/transformers#30654
All the best!

sproocht May 18, 2024

Hi @tuandattt ,
This issue is now resolved in this fix and merged into the latest version of transformers.

andrespimartin · 2023-08-17T13:16:34Z

andrespimartin
Aug 17, 2023
Author

As long as your language is included in the Whisper language, it will be correctly encoded and decoded, so yes, it is language independent.

Regarding the self.processor.tokenizer.batch_decode, it is used when computing the metrics for the ASR task, so it is correct to skip special tokens (you only want to compute the metric of the ASR task).
Note that you are including the language label in the loss calculation. If you want to compute the metrics for the LID task, you will need to override or rewrite the compute metrics function.

3 replies

ILG2021 Nov 6, 2023

hello @andrespimartin , I tried to train with language label, but the evaluate wer is not that stable. When I train without language column, the wer is reduce stablely. But when I use the language label, it may increase much. Have you ever encountered this problem?

andrespimartin Nov 7, 2023
Author

Hi @ILG2021, I have been checking my training logs and yes, it seems that when you add the language label the eval WER does not reduce in a stable way. I haven't had time to look at the code, but I understand that the language label affects the way the WER is calculated, because in tests the results are good.

Best

ILG2021 Nov 23, 2023

When I finetune multi language, I found that if the languages is more than 3, it is very easy to overfitting. Every language's data is about 1.5 hours to 3 hours. Anyone encounted this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning Whisper in more than one language #1432

{{title}}

Replies: 2 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fine-tuning Whisper in more than one language #1432

Replies: 2 comments · 18 replies

andrespimartin Aug 17, 2023 Author

andrespimartin Aug 17, 2023 Author

andrespimartin Nov 7, 2023 Author

Replies: 2 comments 18 replies

andrespimartin Aug 17, 2023
Author

andrespimartin
Aug 17, 2023
Author

andrespimartin Nov 7, 2023
Author