Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi-3 conversation format, example training script and perplexity metric #1582

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

brianfitzgerald
Copy link
Contributor

Adds phi-3 conversation template, and example script, to demonstrate how to fine-tune with Alpaca-format datasets with the Phi-3 pretraining format.

I've also added a Perplexity metric, which I ended up writing a variant of the Huggingface evaluate Perplexity metric, as that implementation both re-tokenizes and loads a separate copy of the LLM within the metric. Instead, this uses the already-loaded model and tokenizer, and the already-tokenized validation samples as the parameters for scoring perplexity.

How has this been tested?

Trained an example LoRA with the provided config.
Unit test is provided for the Perplexity metric.

Full list of changes

  • Added Phi-3 example script.
  • Added Phi-3 conversation template.
  • Added Phi-3 Alpaca prompt format.
  • Added Perplexity metric, and unit test.
  • Fixed an issue with caching dataset splits when the split_size is >1, i.e. a no. of samples.

Social Handles (Optional)

https://twitter.com/bfitzgerald242
https://brianfitzgerald.xyz/

Thanks for the review!

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! 👍

I can help get the tests/linter passing tomorrow

@winglian
Copy link
Collaborator

@brianfitzgerald Made some additional fixes, can you check to see if everything still seems correct? thanks!

@brianfitzgerald
Copy link
Contributor Author

LGTM!

@hammoudhasan
Copy link

This LGTM but I was testing it out and there might be an issue with Phi-3 and flash-attention. On 4xA100s node I have a warning is obtained when training Phi-3 (ONLY on Phi-3 this error occurs tested different models too).

[2024-05-23 14:20:16,328] [WARNING] [transformers_modules.microsoft.Phi-3-mini-4k-instruct.5fa34190089f0ee40f9cce3cafc396b89b2e5e83[99/1823]_phi3.warning_once:329] [PID:1805189] You are not running the flash-attention implementation, expect numerical differences.

This might be related to huggingface/transformers#30547

@e-p-armstrong
Copy link

I tried out this branch and seem to have run into an issue where it hangs forever when tokenizing prompts at the start of a training run. It gets through a few then hangs on a dataset of size 1.

image

Control+C is very slow/hangs as well. Need to do it twice.

Config:

base_model: microsoft/Phi-3-mini-4k-instruct
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
chat_template: phi_3


load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: pretraining_vision.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_0.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_wiki.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_1.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_api.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_2.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_docs.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_DOCS.jsonl
    ds_type: json
    type: sharegpt
dataset_prepared_path: last_run_prepared
output_dir: ./verus-out

sequence_len: 3000
sample_packing: true
pad_to_sequence_len: true

wandb_project: verus-phi3-experiment-2
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 1
eval_batch_size: 1
num_epochs: 5
optimizer: galore_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0000035
cosine_min_lr_ratio: 0
weight_decay: 0 # no weight decay to maximize fact memorization (thanks cgato!)
# adamw hyperparams
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 0.00000001
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization 

optim_args:
# For Galore Optimizers the following optim_args are available
    rank: 256 # type: int
    update_proj_gap: 200  # type: int
    scale: 0.25  # type: float
    proj_type: "std" # type: str, default = std

optim_target_modules: 
  - mlp
  - self_attn
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
auto_resume_from_checkpoints: false
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 4
debug:
deepspeed: deepspeed_configs/zero2.json

@hammoudhasan
Copy link

Btw, my issue here got resolved when I turned off sample packing. Maybe Phi-3 sample packing with Flash-Attention isn't compatible. @brianfitzgerald

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants