Skip to content

PyThaiNLP/WangChanGLM

Repository files navigation

WangChanGLM 🐘 - The Multilingual Instruction-Following Model

Blog | Codes | Demo

WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an English-only dataset (wangchanglm-7.5B-sft-en) and another checkpoint further finetuned on Google-Translated Thai dataset (wangchanglm-7.5B-sft-enth). We perform Vicuna-style evaluation using both humans and ChatGPT (in our case, gpt-3.5-turbo since we are still on the waitlist for gpt-4) and observe some discrepancies between the two types of annoators. All training and evaluation codes are shared under the Apache-2.0 license in our Github, as well as datasets and model weights on HuggingFace. In a similar manner to Dolly v2, we use only use open-source, commercially permissive pretrained models and datasets, our models are neither restricted by non-commercial clause like models that use LLaMA as base nor non-compete clause like models that use self-instruct datasets from ChatGPT. See our live demo here.

Models

We provide various versions of our models as follows:

Sharded versions used in demo:

Training Sets

We provide our training sets as follows:

Finetuning

Multi-world LoRA

We finetuned XGLM-7.5B on 4 V100 GPU (32GB VARM) with the hyperparameters described in script/train_sft_peft_multi_world.py.

python -m torch.distributed.launch --nproc_per_node=4 train_sft_peft_multi_world.py \
--per_device_train_batch_size 1 --gradient_accumulation_steps 32 \ #effective batch size = 128 (4 GPUs * 1 batch size * 32 gradient accumulation)
--wandb_project your_project_name \
--model_name facebook/xglm-7.5B \
--dataset_name pythainlp/final_training_set_v1 \ 
--adapter_name save_adapter_to

The adapter is merged to the main weights with the script from lvwerra/trl.

Single-world LoRA

It is possible to finetune XGLM-7.5B on a single 32GB-VRAM GPU or multiple GPUs with a smaller VRAM with the hyperparameters described in script/train_sft_peft_single_world.py.

python train_sft_peft_single_world.py \
--per_device_train_batch_size 2 --gradient_accumulation_steps 64 \ #effective batch size = 128 (1 GPU * 2 batch size * 64 gradient accumulation)
--wandb_project your_project_name \
--model_name facebook/xglm-7.5B \
--dataset_name pythainlp/final_training_set_v1 \ 
--adapter_name save_adapter_to

Full-finetuning

We also provide a script for full finetuning we experimented with a smaller model on a different set of training data.

python -m torch.distributed.launch --nproc_per_node=8 train_sft.py \
--per_device_train_batch_size=8 --per_device_eval_batch_size=8 --gradient_accumulation_steps=16 \
--model_name=facebook/xglm-1.7B --bf16 --deepspeed=../config/sft_deepspeed_config.json

Inference

We performed inference on the OpenAssistant prompts using hyperparameters described in script/generate_huggingface_answer.py.

python generate_huggingface_answer.py --input_fname ../data/oasst1_gpt35turbo_answer.csv \
--model_name pythainlp/wangchanglm-7.5B-sft-en \
--tokenizer_name pythainlp/wangchanglm-7.5B-sft-en \
--output_fname ../data/oasst1_wangchang_sft_en_only_answer_answer.csv 

Evaluation

We evaluated any pair of model answers using gpt-3.5-turbo as described in script/eval_vicuna_style.py. The entire inference and evaluation is stored in script/infer_and_eval.sh. The human questionnaires are stored in data/human_questionnaire.

Environmental Impact

Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.432 kgCO2eq/kWh. A cumulative of 500 hours of computation was performed on hardware of type Tesla V100-SXM2-32GB (TDP of 300W). Total emissions are estimated to be 64.8 CO2eq of which 0 percents were directly offset. Estimations were conducted using the MachineLearning Impact calculator presented in lacoste2019quantifying.

Bibtex

@software{charin_polpanumas_2023_7878101,
  author       = {Charin Polpanumas and
                  Wannaphong Phatthiyaphaibun and
                  Patomporn Payoungkhamdee and
                  Peerat Limkonchotiwat and
                  Lalita Lowphansirikul and
                  Can Udomcharoenchaikit and
                  Titipat Achakulwisut and
                  Ekapol Chuangsuwanich and
                  Sarana Nutanong},
  title        = {{WangChanGLM🐘 — The Multilingual Instruction- 
                   Following Model}},
  month        = apr,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.1},
  doi          = {10.5281/zenodo.7878101},
  url          = {https://doi.org/10.5281/zenodo.7878101}
}

Acknowledgements

We would like to thank Huggingface for the open-source infrastructure and ecosystem they have built, especially lvwerra of the trl repository. We give our appreciation to the open-source finetuning pioneers that come before us including but not limited to Alpaca, Alpaca-LoRA, GPT4All, OpenAssistant, Koala, Vicuna, and Dolly.

License

The source code is licensed under the Apache-2.0 license. The model weights are licensed under CC-BY-SA 4.0. Finetuning datasets are sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), and Hello-SimpleAI HC3 (CC-BY SA).