Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tutorial on the official Torchtune website do not work #972

Closed
MaxwelsDonc opened this issue May 13, 2024 · 7 comments
Closed

The tutorial on the official Torchtune website do not work #972

MaxwelsDonc opened this issue May 13, 2024 · 7 comments

Comments

@MaxwelsDonc
Copy link

I am trying to train my own model based on llama 3. I followed the tutorial on the official Torchtune website (https://pytorch.org/torchtune/main/tutorials/llama3.html) and encountered some issues. Below, I will provide a detailed description of the steps I followed and the problems I encountered.

Download

I ran the following command, and the model was downloaded to my local laptop. The model directory is "./Meta-Llama-3-8B".

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir Meta-Llama-3-8B \
    --hf-token  ***

The download model structure are shown as below:

./Meta-Llama-3-8B/
|- original/
    |- consolidated.00.pth
    |- tokenizer.model
|- config.json
|- generation_config.json
|- model.safetensors.index.json 
|- special_tokens_map.json
|- tokenizer_config.json
|- tokenizer.json
|- USE_POLICY.md
|- README.md
|- .gitattributes
|- LICENSE

Then I run the command tune run lora_finetune_single_device --config llama3/8B_lora_single_device given by the website and it failed cause the default model saved path should be /tmp/Meta-Llama-3-8B/. So I run the command tune cp llama3/8B_lora_single_device ./finetune.yaml and modified the finetune.yaml file as follows:

model:
  _component_: torchtune.models.llama3.lora_llama3_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  apply_lora_to_mlp: False
  apply_lora_to_output: False
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: Meta-Llama-3-8B/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: Meta-Llama-3-8B/original/
  checkpoint_files: [
    consolidated.00.pth
  ]
  recipe_checkpoint: null
  output_dir: Meta-Llama-3-8B/finetune
  model_type: LLAMA3
resume_from_checkpoint: False

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
  train_on_input: True
seed: null
shuffle: True
batch_size: 2

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01
  lr: 3e-4
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

loss:
  _component_: torch.nn.CrossEntropyLoss

# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 64
compile: False

# Logging
output_dir: Meta-Llama-3-8B/
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: ${output_dir}
log_every_n_steps: null

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: True

# Profiler (disabled)
profiler:
  _component_: torchtune.utils.profiler
  enabled: False

It works and I finetune the model on the alpaca_cleaned_dataset, finally I got two filesmeta_model_0.pt and adapter_0.pt in the directory ./Meta-Llama-3-8B/finetune.

Evaluation

Then I want to evaluate the finetune model by using EleutherAI’s Eval Harness as said in the official website. So I run command tune cp eleuther_evaluation ./eval_config.yaml and modified the file as follows:

model:
  _component_: torchtune.models.llama3.llama3_7b

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: Meta-Llama-3-8B/finetune
  checkpoint_files: [
    meta_model_0.pt
  ]
  recipe_checkpoint: null
  output_dir: Meta-Llama-3-8B/finetune
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: Meta-Llama-3-8B/original/tokenizer.model # Cause we do not generate any tokenizer, So I reuse the original tokenizer.

# Environment
device: cuda
dtype: bf16
seed: 217

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096

# Quantization specific args
quantizer: null

I run tune run eleuther_eval --config ./eval_config.yaml and It failed, cause we do not have any config files, so I copy all config files in to ./Meta-Llama-3-8B/finetune/ and I rerun the command tune run eleuther_eval --config ./eval_config.yaml and it fail again with the following output:

exception: error converting the state dict. found unexpected key: "tok_embeddings.weight". please make sure you're loading a checkpoint with the right format.

I think it maybe the finetune problem, so I modified the eval_config.yaml and evaluate the original model, but it failed again with the same exception.

MY question

-1 In the process I try to use two gpus rather than one by running the command tune run --nproc_per_node 2 lora_finetune_distributed --config finetune.yaml, but I got a longer training time, I do not know why.

-2 I do know the two file (meta_model_0.pt and adapter_0.pt ) I got is correct outputs (Is the finetune output only two files?)?

-3 Why I got the same problem when I try to evaluate the original and finetune models? I am try to conduct experiments llama 2 to avoid the problem.

@kartikayk
Copy link
Contributor

Thanks for opening this issue! Sorry that you ran into this issue - we're working on beefing up our documentation and also the interoperability between fine-tuning, eval and inference (cc: @joecummings)

For your questions specifically:

in the process I try to use two gpus rather than one by running the command tune run --nproc_per_node 2 lora_finetune_distributed --config finetune.yaml, but I got a longer training time, I do not know why

When you increase number of GPUs, you're sharding the model and incurring more communication overhead. You can offset this by increasing your batch_size and increasing overall throughput. How much memory do you have available? You should be able to ramp up the batch_size decently.

do know the two file (meta_model_0.pt and adapter_0.pt ) I got is correct outputs (Is the finetune output only two files?)?

Yes these are right. meta_model_0.pt refers to the merged LoRa weights and adapter has just the LoRA params.

Why I got the same problem when I try to evaluate the original and finetune models? I am try to conduct experiments llama 2 to avoid the problem.

You're currently using the meta format checkpoints and so you need to update to FullModelMetaCheckpointer. You can find more details here.

Let me know if this answers your questions.

@MaxwelsDonc
Copy link
Author

@kartikayk Thanks for your response. I have update to FullModelMetaCheckpointer with a yaml file like:

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: Meta-Llama-3-8B
  checkpoint_files: [meta_model_0.pt]
  adapter_checkpoint: adapter_0.pt
  recipe_checkpoint: null
  output_dir: Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096

# Quantization specific args
quantizer: null

But it still doesn't work with the output AssertionError: aggregation named 'bypass' conflicts with existing registered aggregation!.

And I have two questions:

  1. Is my YAML file correct? Should I use the tokenizer.model and config files from the original model? I am not getting any newly generated files except for meta_model_0.pt and adapter_0.pt.

  2. What's the meaning of the AssertionError: aggregation named 'bypass' conflicts with existing registered aggregation!, and how can I fix it ?

@MaxwelsDonc
Copy link
Author

I now try to using the finetune model and the original model to generate the text using the following yaml file:

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: ./Meta-Llama-3-8B/original/
  checkpoint_files: [consolidated.00.pth]
  output_dir: ./Meta-Llama-3-8B/
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: Meta-Llama-3-8B/original/tokenizer.model

# Generation arguments; defaults taken from gpt-fast
prompt: "Hello, my name is"
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

quantizer: null

But it failed again with the output [Errno 2] No such file or directory: '<project path>/generation' I do not know why.

@ebsmothers
Copy link
Contributor

@MaxwelsDonc thanks for the questions.

Is my YAML file correct? Should I use the tokenizer.model and config files from the original model? I am not getting any newly generated files except for meta_model_0.pt and adapter_0.pt.

Yes, you should use the same tokenizer file from the original model. I'm not sure what config file you mean here, but meta_model_0.pt and adapter_0.pt are the expected files you should get out of a fine-tune. For your eval run, you actually do not need to use adapter_0.pt at all and can remove it from your config. This is because meta_model_0.pt already contains your learned LoRA weights merged back into the original model.

What's the meaning of the AssertionError: aggregation named 'bypass' conflicts with existing registered aggregation!, and how can I fix it ?

This actually appears to be coming the Eleuther side of things: see here. cc @joecummings, any idea what might be causing this?

But it failed again with the output [Errno 2] No such file or directory: '/generation' I do not know why.

Can you share the full stack trace and the command you ran here? That'll make it easier to pinpoint where exactly the error is coming from.

@MaxwelsDonc
Copy link
Author

Can you share the full stack trace and the command you ran here? That'll make it easier to pinpoint where exactly the error is coming from.

@ebsmothers sure, here is my yaml file:

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: ./Meta-Llama-3-8B/original/
  checkpoint_files: [consolidated.00.pth]
  output_dir: ./Meta-Llama-3-8B/
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: Meta-Llama-3-8B/original/tokenizer.model

# Generation arguments; defaults taken from gpt-fast
prompt: "Hello, my name is"
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

quantizer: null

and my file are organized as follows:

./Meta-Llama-3-8B/
|- original/
    |- consolidated.00.pth
    |- tokenizer.model
    |- config.json
    |- generation_config.json
    |- model.safetensors.index.json 
    |- special_tokens_map.json
    |- tokenizer_config.json
    |- tokenizer.json
|- adapter_0.pt
|-meta_model_0.pt
|- config.json
|- generation_config.json
|- model.safetensors.index.json 
|- special_tokens_map.json
|- tokenizer_config.json
|- tokenizer.json
|- USE_POLICY.md
|- README.md
|- .gitattributes
|- LICENSE

Then I run the command tune run generation --config generation_config.yaml, and I got an output as follows:

2024-05-15:00:21:41,084 WARNING  [loading.py:546] Using the latest cached version of the module from /home/zzh/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--exact_match/009c8b5313309ea5b135d526433d5ee76508ba1554cbe88310a30f85bb57ec88 (last modified on Tue May 14 00:53:06 2024) since it couldn't be found locally at evaluate-metric--exact_match, or remotely on the Hugging Face Hub.
2024-05-15:00:21:41,088 INFO     [GPTQ.py:56] lm_eval is not installed, GPTQ may not be usable
Traceback (most recent call last):
  File "/home/zzh/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
    self._run_single_device(args)
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/home/zzh/miniconda3/envs/torchtune/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/zzh/Documents/llama3 build/generation'

But I have run the command pip install "lm_eval==0.4.*", I do not know why I still encounter such a problem.

@ebsmothers
Copy link
Contributor

@MaxwelsDonc thanks for the additional info. Can you try changing tune run generation to tune run generate? (You can also run tune ls to see the names for various recipes and configs)

@MaxwelsDonc
Copy link
Author

@MaxwelsDonc thanks for the additional info. Can you try changing tune run generation to tune run generate? (You can also run tune ls to see the names for various recipes and configs)

@ebsmothers The problems has been solved by changing the network, I do not know why the network will affect the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants