Retrieval Augmented Generation (RAG) Based Document ChatBot

Deployment Steps

These steps have been tested to work on the Ohio Super Computer (OSC) system. If you are not using OSC, you will have to follow similar yet slightly different steps to get the project running, for example downloading a different version of CUDA-enabled pytorch among other potential differences. The OSC system uses a BASH shell, so if you are using windows os then all the shell commands will be different.

Synthetic Data Generation Pipeline

Clone the GitHub repository

git clone https://github.com/EthanGlenwright775/CSE-5914-Answer-Bot.git

Load necessary OSC modules

module load miniconda3 cuda

Create a new conda environment for the pipeline

conda env create -f pipeline_environment.yml

Request resources from OSC

sinteractive -A <project_name> -g 1

Reload the necessary modules on the partition granted by OSC

module load miniconda3 cuda

Acitvate the environment

conda activate pipeline_1

Modify the arguments in pipeline.sh

article_index: starting point of article db from which to pull articles
article_count: number of articles from which to generate questions
thread_count: number of threads to use
q_eval_threshold: minimum BERT score necessary for QA pairs to be kept in training data
ouput_directory: output directory for training data files
training_file: name of training data file
validation_file: name of validation data file
testing_file: name of testing data file
article_db: article database from which to pull articles (cnn_news, daily_mail, cc_news)
qa_gen_method: methodology by which to generate QA pairs (rephrase, summarize)

Run pipeline

bash pipeline.sh

Review the generated data in the output directory that you specified in step 7

Model Training and Interface

Clone the GitHub repository

git clone https://github.com/EthanGlenwright775/CSE-5914-Answer-Bot.git

Load necessary OSC modules

module load miniconda3 cuda

Create a new conda environment for the pipeline

conda env create -f training_environment.yml

Request resources from OSC

sinteractive -A <project_name> -g 1

Reload the necessary modules on the partition granted by OSC

module load miniconda3 cuda

Acitvate the environment

conda activate training_1

Modify the arguments in trainSeq2Seq.sh
Run training

bash trainSeq2Seq.sh

After training has completed, modify and then run interfaceSeq2Seq.sh

bash interfaceSeq2Seq.sh

Submit articls and ask questions to the model!

Acknowledgements

Credit to Dr. Eric Fosler-Lussier and Amad Hussain for their guidance throughout this project. Additional credit to Amad Hussain for providing some portions of the repository including lightning_t5_trainer.py, training_environment.yml, and trainSeq2Seq.sh

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
QA-Evaluator		QA-Evaluator
QA-Pipeline		QA-Pipeline
Sample_Output		Sample_Output
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
interfaceSeq2Seq.sh		interfaceSeq2Seq.sh
lightning_t5_trainer.py		lightning_t5_trainer.py
osc_generate_data.sh		osc_generate_data.sh
osc_train.sh		osc_train.sh
pipeline.sh		pipeline.sh
pipeline_environment.yml		pipeline_environment.yml
pipeline_secondary.sh		pipeline_secondary.sh
trainSeq2Seq.sh		trainSeq2Seq.sh
training_environment.yml		training_environment.yml

EthanGlenwright775/CSE-5914-Answer-Bot

Folders and files

Latest commit

History

Repository files navigation

Retrieval Augmented Generation (RAG) Based Document ChatBot

Deployment Steps

Synthetic Data Generation Pipeline

Model Training and Interface

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages