Fine tuning opus nmt ar-en using my own dataset #77

theamato · 2023-02-20T10:59:49Z

Hi,

I want to fine tune the opus nt ar-en model using my own dataset, but I'm not sure what type of files my training data should be in? In the huggingface Marian tutorial (https://huggingface.co/docs/transformers/model_doc/marian) they just pass in lists of sentences, but I also read somewhere that I'm supposed to preprocess the data with Sentencepiece first. Or is sentencepiece "built in" into the arian tokenizer? All help is much appreciated.

jorgtied · 2023-03-06T15:34:25Z

I do fine-tuning directly with MarianNMT. Maybe you could ask at the transformers git repository how to do finetuning with their library? If you use OPUS-MT models and marian-nmt then you would need the subword tokenisation on the fine-tuning data as well.

theamato changed the title ~~Fine tuning opus nt ar-en using my own dataset~~ Fine tuning opus nmt ar-en using my own dataset Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning opus nmt ar-en using my own dataset #77

Fine tuning opus nmt ar-en using my own dataset #77

theamato commented Feb 20, 2023

jorgtied commented Mar 6, 2023

Fine tuning opus nmt ar-en using my own dataset #77

Fine tuning opus nmt ar-en using my own dataset #77

Comments

theamato commented Feb 20, 2023

jorgtied commented Mar 6, 2023