Phi-3 conversation format, example training script and perplexity metric #1582

brianfitzgerald · 2024-05-01T02:11:18Z

Adds phi-3 conversation template, and example script, to demonstrate how to fine-tune with Alpaca-format datasets with the Phi-3 pretraining format.

I've also added a Perplexity metric, which I ended up writing a variant of the Huggingface evaluate Perplexity metric, as that implementation both re-tokenizes and loads a separate copy of the LLM within the metric. Instead, this uses the already-loaded model and tokenizer, and the already-tokenized validation samples as the parameters for scoring perplexity.

How has this been tested?

Trained an example LoRA with the provided config.
Unit test is provided for the Perplexity metric.

Full list of changes

Added Phi-3 example script.
Added Phi-3 conversation template.
Added Phi-3 Alpaca prompt format.
Added Perplexity metric, and unit test.
Fixed an issue with caching dataset splits when the split_size is >1, i.e. a no. of samples.

Social Handles (Optional)

https://twitter.com/bfitzgerald242
https://brianfitzgerald.xyz/

Thanks for the review!

winglian

Amazing! 👍

I can help get the tests/linter passing tomorrow

winglian · 2024-05-14T14:16:16Z

@brianfitzgerald Made some additional fixes, can you check to see if everything still seems correct? thanks!

brianfitzgerald · 2024-05-16T22:51:06Z

LGTM!

hammoudhasan · 2024-05-23T14:25:40Z

This LGTM but I was testing it out and there might be an issue with Phi-3 and flash-attention. On 4xA100s node I have a warning is obtained when training Phi-3 (ONLY on Phi-3 this error occurs tested different models too).

[2024-05-23 14:20:16,328] [WARNING] [transformers_modules.microsoft.Phi-3-mini-4k-instruct.5fa34190089f0ee40f9cce3cafc396b89b2e5e83[99/1823]_phi3.warning_once:329] [PID:1805189] You are not running the flash-attention implementation, expect numerical differences.

This might be related to huggingface/transformers#30547

e-p-armstrong · 2024-05-29T07:11:34Z

I tried out this branch and seem to have run into an issue where it hangs forever when tokenizing prompts at the start of a training run. It gets through a few then hangs on a dataset of size 1.

Control+C is very slow/hangs as well. Need to do it twice.

Config:

base_model: microsoft/Phi-3-mini-4k-instruct
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
chat_template: phi_3


load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: pretraining_vision.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_VISION.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_0.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_wiki.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_WIKI.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_1.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_api.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_API.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: general_assistant_split_2.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: pretraining_docs.json
    ds_type: json
    type: completion
  - path: json
    data_files: simplified_data_rag_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_rag_OPENENDED_DOCS.jsonl
    ds_type: json
    type: sharegpt
  - path: json
    data_files: simplified_data_no_rag_OPENENDED_DOCS.jsonl
    ds_type: json
    type: sharegpt
dataset_prepared_path: last_run_prepared
output_dir: ./verus-out

sequence_len: 3000
sample_packing: true
pad_to_sequence_len: true

wandb_project: verus-phi3-experiment-2
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 1
eval_batch_size: 1
num_epochs: 5
optimizer: galore_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0000035
cosine_min_lr_ratio: 0
weight_decay: 0 # no weight decay to maximize fact memorization (thanks cgato!)
# adamw hyperparams
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 0.00000001
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization 

optim_args:
# For Galore Optimizers the following optim_args are available
    rank: 256 # type: int
    update_proj_gap: 200  # type: int
    scale: 0.25  # type: float
    proj_type: "std" # type: str, default = std

optim_target_modules: 
  - mlp
  - self_attn
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
auto_resume_from_checkpoints: false
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 4
debug:
deepspeed: deepspeed_configs/zero2.json

hammoudhasan · 2024-05-29T12:17:05Z

Btw, my issue here got resolved when I turned off sample packing. Maybe Phi-3 sample packing with Flash-Attention isn't compatible. @brianfitzgerald

winglian approved these changes May 3, 2024

View reviewed changes

winglian force-pushed the bf/phi-3-ppl branch from e9a2d05 to ba7e4e2 Compare May 14, 2024 13:24

winglian mentioned this pull request May 14, 2024

Adding Phi-3 model #1580

Open

winglian force-pushed the bf/phi-3-ppl branch from 6961cb9 to a2d6158 Compare May 30, 2024 02:33

brianfitzgerald and others added 7 commits May 30, 2024 13:42

phi-3 support and perplexity metric

0faf2c2

phi-3 chat template

e6b4ee1

metrics updates

eff215c

chore: lint

329ccab

fix assertion on Tensor

3eb5719

fix tests since tokenization happens in the metric

b1a9553

fix perplexity value of shorter passage

8ef98ff

winglian force-pushed the bf/phi-3-ppl branch from a2d6158 to 8ef98ff Compare May 30, 2024 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-3 conversation format, example training script and perplexity metric #1582

Phi-3 conversation format, example training script and perplexity metric #1582

brianfitzgerald commented May 1, 2024

winglian left a comment

winglian commented May 14, 2024

brianfitzgerald commented May 16, 2024

hammoudhasan commented May 23, 2024

e-p-armstrong commented May 29, 2024

hammoudhasan commented May 29, 2024

Phi-3 conversation format, example training script and perplexity metric #1582

Are you sure you want to change the base?

Phi-3 conversation format, example training script and perplexity metric #1582

Conversation

brianfitzgerald commented May 1, 2024

How has this been tested?

Full list of changes

Social Handles (Optional)

winglian left a comment

Choose a reason for hiding this comment

winglian commented May 14, 2024

brianfitzgerald commented May 16, 2024

hammoudhasan commented May 23, 2024

e-p-armstrong commented May 29, 2024

hammoudhasan commented May 29, 2024