-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The tutorial on the official Torchtune website do not work #972
Comments
Thanks for opening this issue! Sorry that you ran into this issue - we're working on beefing up our documentation and also the interoperability between fine-tuning, eval and inference (cc: @joecummings) For your questions specifically:
When you increase number of GPUs, you're sharding the model and incurring more communication overhead. You can offset this by increasing your batch_size and increasing overall throughput. How much memory do you have available? You should be able to ramp up the batch_size decently.
Yes these are right.
You're currently using the meta format checkpoints and so you need to update to Let me know if this answers your questions. |
@kartikayk Thanks for your response. I have update to model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: Meta-Llama-3-8B
checkpoint_files: [meta_model_0.pt]
adapter_checkpoint: adapter_0.pt
recipe_checkpoint: null
output_dir: Meta-Llama-3-8B
model_type: LLAMA3
# Tokenizer
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: Meta-Llama-3-8B/original/tokenizer.model
# Environment
device: cuda
dtype: bf16
seed: 217
# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096
# Quantization specific args
quantizer: null But it still doesn't work with the output And I have two questions:
|
I now try to using the finetune model and the original model to generate the text using the following yaml file: model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: ./Meta-Llama-3-8B/original/
checkpoint_files: [consolidated.00.pth]
output_dir: ./Meta-Llama-3-8B/
model_type: LLAMA3
device: cuda
dtype: bf16
seed: 1234
# Tokenizer arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: Meta-Llama-3-8B/original/tokenizer.model
# Generation arguments; defaults taken from gpt-fast
prompt: "Hello, my name is"
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
quantizer: null But it failed again with the output |
@MaxwelsDonc thanks for the questions.
Yes, you should use the same tokenizer file from the original model. I'm not sure what config file you mean here, but meta_model_0.pt and adapter_0.pt are the expected files you should get out of a fine-tune. For your eval run, you actually do not need to use adapter_0.pt at all and can remove it from your config. This is because meta_model_0.pt already contains your learned LoRA weights merged back into the original model.
This actually appears to be coming the Eleuther side of things: see here. cc @joecummings, any idea what might be causing this?
Can you share the full stack trace and the command you ran here? That'll make it easier to pinpoint where exactly the error is coming from. |
@ebsmothers sure, here is my yaml file: model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: ./Meta-Llama-3-8B/original/
checkpoint_files: [consolidated.00.pth]
output_dir: ./Meta-Llama-3-8B/
model_type: LLAMA3
device: cuda
dtype: bf16
seed: 1234
# Tokenizer arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: Meta-Llama-3-8B/original/tokenizer.model
# Generation arguments; defaults taken from gpt-fast
prompt: "Hello, my name is"
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
quantizer: null
and my file are organized as follows:
Then I run the command 2024-05-15:00:21:41,084 WARNING [loading.py:546] Using the latest cached version of the module from /home/zzh/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--exact_match/009c8b5313309ea5b135d526433d5ee76508ba1554cbe88310a30f85bb57ec88 (last modified on Tue May 14 00:53:06 2024) since it couldn't be found locally at evaluate-metric--exact_match, or remotely on the Hugging Face Hub.
2024-05-15:00:21:41,088 INFO [GPTQ.py:56] lm_eval is not installed, GPTQ may not be usable
Traceback (most recent call last):
File "/home/zzh/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
self._run_single_device(args)
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
runpy.run_path(str(args.recipe), run_name="__main__")
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/zzh/Documents/llama3 build/generation' But I have run the command |
@MaxwelsDonc thanks for the additional info. Can you try changing |
@ebsmothers The problems has been solved by changing the network, I do not know why the network will affect the code. |
I am trying to train my own model based on llama 3. I followed the tutorial on the official Torchtune website (https://pytorch.org/torchtune/main/tutorials/llama3.html) and encountered some issues. Below, I will provide a detailed description of the steps I followed and the problems I encountered.
Download
I ran the following command, and the model was downloaded to my local laptop. The model directory is "./Meta-Llama-3-8B".
tune download meta-llama/Meta-Llama-3-8B-Instruct \ --output-dir Meta-Llama-3-8B \ --hf-token ***
The download model structure are shown as below:
Then I run the command
tune run lora_finetune_single_device --config llama3/8B_lora_single_device
given by the website and it failed cause the default model saved path should be/tmp/Meta-Llama-3-8B/
. So I run the commandtune cp llama3/8B_lora_single_device ./finetune.yaml
and modified thefinetune.yaml
file as follows:It works and I finetune the model on the alpaca_cleaned_dataset, finally I got two files
meta_model_0.pt
andadapter_0.pt
in the directory./Meta-Llama-3-8B/finetune
.Evaluation
Then I want to evaluate the finetune model by using EleutherAI’s Eval Harness as said in the official website. So I run command
tune cp eleuther_evaluation ./eval_config.yaml
and modified the file as follows:I run
tune run eleuther_eval --config ./eval_config.yaml
and It failed, cause we do not have any config files, so I copy all config files in to./Meta-Llama-3-8B/finetune/
and I rerun the commandtune run eleuther_eval --config ./eval_config.yaml
and it fail again with the following output:I think it maybe the finetune problem, so I modified the
eval_config.yaml
and evaluate the original model, but it failed again with the same exception.MY question
-1 In the process I try to use two gpus rather than one by running the command
tune run --nproc_per_node 2 lora_finetune_distributed --config finetune.yaml
, but I got a longer training time, I do not know why.-2 I do know the two file (
meta_model_0.pt
andadapter_0.pt
) I got is correct outputs (Is the finetune output only two files?)?-3 Why I got the same problem when I try to evaluate the original and finetune models? I am try to conduct experiments llama 2 to avoid the problem.
The text was updated successfully, but these errors were encountered: