Skip to content

Latest commit

 

History

History
220 lines (141 loc) · 10.2 KB

neurips_challenge_quickstart.md

File metadata and controls

220 lines (141 loc) · 10.2 KB

NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide

The NeurIPS 2023 Efficiency Challenge is a competition focused on training 1 LLM for 24 hours on 1 GPU – the team with the best LLM gets to present their results at NeurIPS 2023.

This quick start guide is a short starter guide illustrating the main steps to get started with Lit-GPT, which was selected as the competition's official starter kit.

 

Competition Facts

 

Permitted GPUs:

  • 1x A100 (40 GB RAM);
  • 1x RTX 4090 (24 GB RAM).

 

Permitted models:

  • All transformer-based LLM base models that are not finetuned yet.

The subset of Lit-GPT models supported in this competition is listed in the table below. These don't include models that have been finetuned or otherwise aligned, as per the rules of the challenge.

 

Models in Lit-GPT Reference
Meta AI Llama 2 Base Touvron et al. 2023
TII UAE Falcon Base TII 2023
OpenLM Research OpenLLaMA Geng & Liu 2023
EleutherAI Pythia Biderman et al. 2023
StabilityAI StableLM Base Stability AI 2023

 

Permitted datasets

Any open-source dataset is allowed. Originally, per competition rules, datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitely listed on the challenge website or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.

Examples of permitted datasets are the following:

You are allowed to create your own datasets if they are made publicly accessible under an open-source license, and they are not generated from other LLMs (even open-source ones).

Helpful competition rules relevant to the dataset choice:

  • The maximum prompt/completion length the models are expected to handle is 2048 tokens.
  • The evaluation will be on English texts only.

 

Submission deadline

  • October 25, 2023 (Please check official website in case of updates.)

 

Lit-GPT Setup

Use the following steps to set up the Lit-GPT repository on your machine.

git clone https://github.com/Lightning-AI/lit-gpt
cd lit-gpt
pip install -r requirements.txt tokenizers sentencepiece huggingface_hub

 

Downloading Model Checkpoints

This section explains how to download the StableLM 3B Base model, one of the smallest models supported in Lit-GPT (an even smaller, supported model is Pythia, which starts at 70M parameters). The downloaded and converted checkpoints will occupy approximately 28 Gb of disk space.

python scripts/download.py \
  --repo_id stabilityai/stablelm-base-alpha-3b

python scripts/convert_hf_checkpoint.py \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b

While StableLM 3B Base is useful as a first starter model to set things up, you may want to use the more capable Falcon 7B or Llama 2 7B/13B models later. See the download_* tutorials in Lit-GPT to download other model checkpoints.

After downloading and converting the model checkpoint, you can test the model via the following command:

python generate/base.py \
  --prompt "LLM efficiency competitions are fun, because" \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b

 

Downloading and Preparing Datasets

The following command will download and preprocess the Dolly15k dataset for the StableLM 3B Base model:

python scripts/prepare_dolly.py \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \
  --destination_path data/dolly-stablelm3b

Note

The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.

 

Finetuning

Low-rank Adaptation (LoRA) is a good choice for a first finetuning run. The Dolly dataset has ~15k samples, and the finetuning might take half an hour.

To accelerate this for testing purposes, edit the ./finetune/lora.py script and change max_iters = 50000 to max_iters = 500 at the top of the file.

Note

The Dolly dataset has a relatively long context length, which could result in out-of-memory issues. The maximum context length that is used for the evaluation, according to the official competition rules, is 2,048 tokens. Hence, it's highly recommended to prepare the dataset with a fixed max length, for example, python scripts/prepare_dolly.py --max_seq_length 2048.

The following command finetunes the model:

CUDA_VISIBLE_DEVICES=2 python finetune/lora.py \
  --data_dir data/dolly-stablelm3b \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
  --out_dir "out/stablelm3b/dolly/lora/experiment1" \
  --precision "bf16-true"

With 500 iterations, this takes approximately 1-2 min on an A100 and uses 26.30 GB GPU memory.

If you are using an RTX 4090, change micro_batch_size=4 to micro_batch_size=1 so that the model will only use 12.01 GB of memory.

(More finetuning settings are explained here.)

 

Local Evaluation

The official Lit-GPT competition will use a small subset of HELM tasks for model evaluation, which includes BigBench (general), MMLU (knowledge), TruthfulQA (knowledge and harm in a multiple choice format), CNN/DailyMail (news summarization), GSM8K (math), and BBQ (bias).

HELM is currently also being integrated into Lit-GPT to evaluate LLMs before submission.

However, a tool with a more convenient interface is Eleuther AI's Evaluation Harness, which contains some tasks, for example, BigBench, TruthfulQA, and GSM8k, that overlap with HELM. We can set up the Evaluation Harness as follows:

pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@master

And then we can use it via the following command:

python eval/lm_eval_harness.py \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
  --precision "bf16-true" \
  --eval_tasks "[truthfulqa_mc,gsm8k]" \
  --batch_size 4 \
  --save_filepath "results-stablelm-3b.json"

(You can find a full task list in the task table here.)

To evaluate a LoRA-finetuned model, you need to first merge the LoRA weights with the base model to create a new checkpoint file:

python scripts/merge_lora.py \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b/" \
  --lora_path "out/stablelm3b/dolly/lora/experiment1/lit_model_lora_finetuned.pth" \
  --out_dir "out/lora_merged/stablelm-base-alpha-3b/"
cp checkpoints/stabilityai/stablelm-base-alpha-3b/*.json \
out/lora_merged/stablelm-base-alpha-3b/

For more information on LoRA weight merging, please see the Merging LoRA Weights section of the LoRA finetuning documentation.

After merging the weights, we can use the lm_eval_harness.py similar to before with the only difference that we now use the new checkpoint folder containing the merged LoRA model:

python eval/lm_eval_harness.py \
  --checkpoint_dir "out/lora_merged/stablelm-base-alpha-3b" \
  --precision "bf16-true" \
  --eval_tasks "[truthfulqa_mc,gsm8k]" \
  --batch_size 4 \
  --save_filepath "results-stablelm-3b.json"

 

Submission

You will be required to submit a Docker image for the submission itself. Fortunately, the organizers have a GitHub repository with the exact steps here and a toy-submission setup guide to test your model locally before submission.

 

Additional Information & Resources