Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron-LM fine-tuning: No such file or directory model_optim_rng.pt #14

Open
zxyscz opened this issue Aug 25, 2023 · 16 comments
Open

Comments

@zxyscz
Copy link

zxyscz commented Aug 25, 2023

i want use Megatron-LM fine-tuning, but i run the process, get. a error : No such file or directory model_optim_rng.pt model_optim_rng.pt

@Muennighoff
Copy link
Collaborator

Hmm do you have --no_load_optim and --no_load_rng in your script?

model_optim_rng.pt files are not needed and not in the checkpoint I think

@zxyscz
Copy link
Author

zxyscz commented Aug 30, 2023

Hmm do you have --no_load_optim and --no_load_rng in your script?

model_optim_rng.pt files are not needed and not in the checkpoint I think

i run now ,but i did not how to merge many chekpoint and convert to hugging face format, can you help me?

@Muennighoff
Copy link
Collaborator

This is the script for merging & converting:


Let me know if it does not work for you

@zxyscz
Copy link
Author

zxyscz commented Aug 30, 2023

yes, but i should git clone which branch , is this bigcode-project/Megatron-LM#40? , when i use mft branch , it did not run.

@Muennighoff
Copy link
Collaborator

Yeah that one; It is already merged into main so you can probably also use the main branch
It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

@zxyscz
Copy link
Author

zxyscz commented Aug 30, 2023

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

thx

@zxyscz
Copy link
Author

zxyscz commented Aug 31, 2023

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

@Muennighoff
Copy link
Collaborator

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

Yeah that's why we only fine-tune for few steps, e.g. OctoCoder is only fine-tuned for 2M tokens.

@zxyscz
Copy link
Author

zxyscz commented Aug 31, 2023

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

Yeah that's why we only fine-tune for few steps, e.g. OctoCoder is only fine-tuned for 2M tokens.
In addition ,I evaluated starcoderbase-25000 step , the humaneval pass@1 value was 25%, it is lower than 30%. Is it because open source starcoderbase-megatron lm is not final checkpoint?

@Muennighoff
Copy link
Collaborator

What script are you using to evaluate it? That may explain the small difference.
It should be the final checkpoint.

@zxyscz
Copy link
Author

zxyscz commented Sep 1, 2023

What script are you using to evaluate it? That may explain the small difference.
It should be the final checkpoint.

First,I convert this checkpoint to hf format, then using greedy decoding to evaluate.

@zxyscz
Copy link
Author

zxyscz commented Sep 1, 2023

i convert to hf format , is it right ?image

@Muennighoff
Copy link
Collaborator

Yeah that looks correct. I think for pass@1 HumanEval StarCoder is evaluated using temperature=0.2. Also I would set n_samples=20.

@zxyscz
Copy link
Author

zxyscz commented Sep 1, 2023

Yeah that looks correct. I think for pass@1 HumanEval StarCoder is evaluated using temperature=0.2. Also I would set n_samples=20.
I convert megatron model to hf as shown below, loading models is slow.
image
How can i convert model many partition like this:
image

@Muennighoff
Copy link
Collaborator

You have to shard it into multiple files when saving it

@zxyscz
Copy link
Author

zxyscz commented Sep 1, 2023

You have to shard it into multiple files when saving it

How to shard it into multiple files? Is there any code to refer to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants