GitHub - neelblabla/large_language_models_for_processing_emails: Fine-tuning Llama2-7b and other llms for categorising emails for Deutsche Bahn (German National Railways)

Fine-tuning Llama2-7b, BERT base, XLNet base llm(s) for classifying emails for Deutsche Bahn

Hi, welcome!

In this repo, we perform parameter efficient fine tuning on Meta's Llama2-7b, BERT base and XLNet base large language models using Enron email corpus to demonstrate a proof of concept around email classification for Deutsche Bahn.

Following is the workflow:

Enron email corpus is an open dataset of more than 500,000 raw emails. We work on a subest of ~1700 pre-labeled emails available here - https://data.world/brianray/enron-email-dataset and another subset of chat GPT3.5 labelled ~12000.
We narrowed down to lables in one layer i.e. we only consider 'Cat_1_level_1'and 'Cat_1_level_2' exclusively for our purpose. The original (layered) labeling methodology is explained in detail here - https://datascience.stackexchange.com/questions/92341/how-to-read-the-labeled-enron-dataset-categories/92737#92737
'Regular Expressions' methods are used to clean the heavily unstructured email bodies in the labeled dataset along with lemmatization of email subject and body. ( Dataset is split into training (~1400) and testing (~300) subsets. Final training dataset is hosted on huggingface here - https://huggingface.co/datasets/neelblabla/enron_labeled_email-llama2-7b_finetuning )
Following models and methodologies are used further:

Llama2-7b

Prompts are generated on the final training dataset and parameter efficient finetuning is performed on Llama2-7b using this dataset. The parameters of the models are merged and the final fine-tuned model is hosted in the following huggingface repo - https://huggingface.co/neelblabla/email-classification-llama2-7b-peft
Model is finally evaluated on the testing dataset (maximum length of input test-prompts restricted to 100) - 65% of model-generated responses have strong indicative signals in the responses matching the human labels of the emails.
It is observed that model's performance degrades if the maximum length of input test-prompts are increased to 180, 200, 300, etc.

BERT_base_uncased-100m

We employ a pre-trained BERT base model containing 110 million parameters.
The email content is tokenized using BERT's pre-trained tokenizer.
To enhance training efficiency, we freeze the first 8 layers of the model, focusing our fine-tuning efforts exclusively on the remaining 4 layers using our training data.
For comprehensive regularization and optimization, you can refer to the specific methods and hyperparameters detailed in the accompanying code.
Additionally, we introduce a classifier layer atop the BERT model, tailored to the number of output classes required for the task at hand.

XLNet_base-110m

Similar workflow as BERT

PEFT Fine-Tuned Llama2-7b Model - https://huggingface.co/neelblabla/email-classification-llama2-7b-peft

( Labeled Enron Dataset for Fine-Tuning on Llama2-7b - https://huggingface.co/datasets/neelblabla/enron_labeled_email-llama2-7b_finetuning )

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
1_finetuning_LLAMA2-7b		1_finetuning_LLAMA2-7b
2_modelling_BERT-base-uc-110m		2_modelling_BERT-base-uc-110m
3_modelling_XLNet_base-110m/3a_modeling_and_evaluation		3_modelling_XLNet_base-110m/3a_modeling_and_evaluation
labeling_using_llm_api-demo		labeling_using_llm_api-demo
.gitignore		.gitignore
LLMs for Email Classification.pdf		LLMs for Email Classification.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1_finetuning_LLAMA2-7b

1_finetuning_LLAMA2-7b

2_modelling_BERT-base-uc-110m

2_modelling_BERT-base-uc-110m

3_modelling_XLNet_base-110m/3a_modeling_and_evaluation

3_modelling_XLNet_base-110m/3a_modeling_and_evaluation

labeling_using_llm_api-demo

labeling_using_llm_api-demo

.gitignore

.gitignore

LLMs for Email Classification.pdf

LLMs for Email Classification.pdf

README.md

README.md

Repository files navigation

Fine-tuning Llama2-7b, BERT base, XLNet base llm(s) for classifying emails for Deutsche Bahn

Llama2-7b

BERT_base_uncased-100m

XLNet_base-110m

About

Releases

Packages

Contributors 4

Languages

neelblabla/large_language_models_for_processing_emails

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning Llama2-7b, BERT base, XLNet base llm(s) for classifying emails for Deutsche Bahn

Llama2-7b

BERT_base_uncased-100m

XLNet_base-110m

About

Topics

Resources

Stars

Watchers

Forks

Languages