Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How Helsinki models (in the transformers library) are trained ? #64

Open
Ahmath-Gadji opened this issue Jun 22, 2022 · 4 comments
Open

Comments

@Ahmath-Gadji
Copy link

Hello @jorgtied

It seems to me that there is no model to translate from french to wolof.
I'm trying to do it myself by training it from scratch using the Huggingface library.
I want to use the same class (MarianMT) as you did for your translation models.
I'm having difficulties with this model because I don't know how to initialize the tokenizer (MarianTokenizer). It requires SentencePiece files ( a .spm extension) file but in general, SentencePiece models are stored in a ".model" extension file and I haven't seen nowhere a sentencePiece model saved in a ".spm". So could you tell me how you did initialize the tokenizer class for your models Please?

Also, I've seen tutorials teaching the process to train translation models from scratch in Hugginface, and apparently, some people are struggling with it too. So code snippets or resources that you used to train the Helsinki models (in Hugginface) are welcome too?

thank you in advance

@jorgtied
Copy link
Member

I never trained any models using the HF transformers library. All models are trained with marian-nmt and then converted to pytorch to make them available from HF. You could do the same if you like and I can give you some more information about how to do that. What are the tutorials that you looked at and what are other people struggling with?

@xyx361100238
Copy link

hello all:
I use MARIANNMT have the same question:
1、train en-zh model according to examples/transformer
2、pre-process use jieba & bpe
3、done with train and test good
4、use convert_marian_to_pytorch.py to converted model

Q:
1、can't save model “*.spm” use sentencePiece
2、how to generate source.spm&target.spm or the steps oftrain model use MarianNMT can use in HF way(pytorch)

Thanks!

@Ahmath-Gadji
Copy link
Author

Ahmath-Gadji commented Jul 19, 2022 via email

@jorgtied
Copy link
Member

jorgtied commented Aug 8, 2022

I used recipes from https://github.com/Helsinki-NLP/OPUS-MT-train but be aware that this is research code and might not work out of the box for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants