Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could I fine tune this model for Chinese datasets? #41

Open
asenasen123 opened this issue Aug 18, 2023 · 11 comments
Open

Could I fine tune this model for Chinese datasets? #41

asenasen123 opened this issue Aug 18, 2023 · 11 comments

Comments

@asenasen123
Copy link

Could you please tell me how i can fine tune for my custom Chinese datasets?

@Muennighoff
Copy link
Owner

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

@asenasen123
Copy link
Author

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

Do many spgt models on Huggingface support Chinese?

@asenasen123
Copy link
Author

If I want to fine-tune the sgpt model, do I just change the dataset?

@Muennighoff
Copy link
Owner

I think only the bloom ones perform well for Chinese.
Yes you can just change the dataset.

@asenasen123
Copy link
Author

I think only the bloom ones perform well for Chinese. Yes you can just change the dataset.

Which Chinese dataset should I evaluate the fine-tuned model on?

@Muennighoff
Copy link
Owner

I would evaluate on the Chinese datasets in MTEB.
If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see embeddings-benchmark/mteb#134

@asenasen123
Copy link
Author

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

@Muennighoff
Copy link
Owner

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard
Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

@asenasen123
Copy link
Author

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard
Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

Thank you very much!

@wilfoderek
Copy link

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

@Muennighoff
Copy link
Owner

Sure if you want to finetune you can follow some of what is outlined in this issue: #2
For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

Sure you can do that too. https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco has also seen a lot of Spanish so it may work well for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants