Genomic GPT

An experiment that finetunes a pretrained LLM (OPT-2.7B and GPT-175B) based on protein sequence -> protein function pairing data scraped from the UniProt database level 4 and 5 confidence level annotations.

Uses colossal AI to optimize for local finetuning of the OPT model on dual 4090s.

Prerequisites

You must install ColossalAI and build from source according to instructions here

You should also have at least 40GB+ of VRAM for finetuning the OPT model if you wish to do it locally. Note that while ColossalAI provides orders of magnitude speedups in certain finetuning situations, because of its design, it cannot offload weights to RAM. Therefore, all of the model must fit into your GPU

Usage

# make dataset will output a jsonl file in the correct format to data directory
# the GPT flag determines whether to preprocess for GPT-3 or OPT
./make_dataset.py -f <path to raw tsv> -o <path to output> --gpt

# run the training script
./run_clm.sh

The original checkpoint was trained on 1.5+ million protein sequence -> protein function annotation data scraped from uniprot. This repository provides only a 200 row small version of the dataset for testing purposes in the rawdata directory

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
genomicGPT		genomicGPT
rawdata		rawdata
LICENSE		LICENSE
context.py		context.py
inference.py		inference.py
make_dataset.py		make_dataset.py
readme.md		readme.md
requirements.txt		requirements.txt
run_clm.py		run_clm.py
run_clm.sh		run_clm.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

genomicGPT

genomicGPT

rawdata

rawdata

LICENSE

LICENSE

context.py

context.py

inference.py

inference.py

make_dataset.py

make_dataset.py

readme.md

readme.md

requirements.txt

requirements.txt

run_clm.py

run_clm.py

run_clm.sh

run_clm.sh

utils.py

utils.py

Repository files navigation

Genomic GPT

Prerequisites

Usage

About

Releases

Packages

Languages

License

ianmkim/genomicGPT

Folders and files

Latest commit

History

Repository files navigation

Genomic GPT

Prerequisites

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages