Skip to content

Latest commit

 

History

History
68 lines (48 loc) · 3.47 KB

README.md

File metadata and controls

68 lines (48 loc) · 3.47 KB

1. Evaluation datasets for the PPI, DDI and ChemProt tasks:

  1. Dataset for PPI
  2. Dataset for DDI
  3. Dataset for ChemProt

2. Augmented evaluation datasets for the PPI, DDI and ChemProt tasks:

  1. Dataset for PPI
  2. Dataset for DDI
  3. Dataset for ChemProt

In those augmented datasets, we also include the original data instances, a.k.a., the format of the file is:
line N-1: original data
Line N: augmented data.

3. Contrastive pre-training procedure:

We implement our project with Tensorflow 1.15 and we utilize the pre-trained BioBERT/PubMedBERT model as our initial model for contrastive pre-training. We can use the following code for the contrastive pre-training: $TASK_NAME='aimed' or 'ddi13' or 'chemprot';
$BERT_DIR is the path where we store the pre-trained BERT model;
$RE_DIR is the path where we have the contrastive learning dataset;
$OUTPUT_DIR is the path where we can store the contrastively pre-trained BERT model;

TASK_NAME="task_name"
BERT_DIR="./biobert_v1.1_pubmed"

RE_DIR="./REdata/contrastive_pre-training_dataset/"
OUTPUT_DIR="./REoutput/model_output"


for i in 2 4 6 8 10
do

	python run_re_cp.py --task_name=$TASK_NAME --do_train=true --do_eval=false --do_predict=false --vocab_file=$BERT_DIR/vocab.txt --bert_config_file=$BERT_DIR/bert_config.json --init_checkpoint=$BERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=256 --learning_rate=2e-5 --num_train_epochs=${i} --do_lower_case=false --data_dir=${RE_DIR} --output_dir=${OUTPUT_DIR} --model_name="cl_pretraining"

done

The datasets for contrastive pre-training are available at:

  1. Contrastive pre-training dataset for PPI
  2. Contrastive pre-training dataset for DDI
  3. Contrastive pre-training dataset for ChemProt

4. Fine-tuning of BERT model:

After the pre-training, we then can fine-tune the BERT model on the evaluation sets of PPI, DDI and ChemProt:

TASK_NAME="task_name"
BERT_DIR="./REoutput/contrastive_pre_trained_model"

RE_DIR="./REdata/aimed/"

OUTPUT_DIR="./REoutput/model_output_folder"


for s in 2 4 6 8 10
do
	for i in {1..10}
	do
		python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=false --do_predict=true --vocab_file=$BERT_DIR/vocab.txt --bert_config_file=$BERT_DIR/bert_config.json --init_checkpoint=$BERT_DIR/model.ckpt-1528 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=${s} --do_lower_case=false --data_dir=${RE_DIR}${i} --output_dir=${OUTPUT_DIR}${i}


	done

	python ./biocodes/re_eval.py --output_path=${OUTPUT_DIR} --answer_path=${RE_DIR} --fold_number=10 --step=${s} --task_name="aimed_constrastive_learning"


done

For the PPI task, we are using 10 fold cross-validation (as shown above), but for DDI and ChemProt, we do not have to do this.