1. Evaluation datasets for the PPI, DDI and ChemProt tasks:

Dataset for PPI
Dataset for DDI
Dataset for ChemProt

2. Augmented evaluation datasets for the PPI, DDI and ChemProt tasks:

Dataset for PPI
Dataset for DDI
Dataset for ChemProt

In those augmented datasets, we also include the original data instances, a.k.a., the format of the file is:
line N-1: original data
Line N: augmented data.

3. Contrastive pre-training procedure:

We implement our project with Tensorflow 1.15 and we utilize the pre-trained BioBERT/PubMedBERT model as our initial model for contrastive pre-training. We can use the following code for the contrastive pre-training: $TASK_NAME='aimed' or 'ddi13' or 'chemprot';
$BERT_DIR is the path where we store the pre-trained BERT model;
$RE_DIR is the path where we have the contrastive learning dataset;
$OUTPUT_DIR is the path where we can store the contrastively pre-trained BERT model;

TASK_NAME="task_name"
BERT_DIR="./biobert_v1.1_pubmed"

RE_DIR="./REdata/contrastive_pre-training_dataset/"
OUTPUT_DIR="./REoutput/model_output"


for i in 2 4 6 8 10
do

	python run_re_cp.py --task_name=$TASK_NAME --do_train=true --do_eval=false --do_predict=false --vocab_file=$BERT_DIR/vocab.txt --bert_config_file=$BERT_DIR/bert_config.json --init_checkpoint=$BERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=256 --learning_rate=2e-5 --num_train_epochs=${i} --do_lower_case=false --data_dir=${RE_DIR} --output_dir=${OUTPUT_DIR} --model_name="cl_pretraining"

done

The datasets for contrastive pre-training are available at:

Contrastive pre-training dataset for PPI
Contrastive pre-training dataset for DDI
Contrastive pre-training dataset for ChemProt

4. Fine-tuning of BERT model:

After the pre-training, we then can fine-tune the BERT model on the evaluation sets of PPI, DDI and ChemProt:

TASK_NAME="task_name"
BERT_DIR="./REoutput/contrastive_pre_trained_model"

RE_DIR="./REdata/aimed/"

OUTPUT_DIR="./REoutput/model_output_folder"


for s in 2 4 6 8 10
do
	for i in {1..10}
	do
		python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=false --do_predict=true --vocab_file=$BERT_DIR/vocab.txt --bert_config_file=$BERT_DIR/bert_config.json --init_checkpoint=$BERT_DIR/model.ckpt-1528 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=${s} --do_lower_case=false --data_dir=${RE_DIR}${i} --output_dir=${OUTPUT_DIR}${i}


	done

	python ./biocodes/re_eval.py --output_path=${OUTPUT_DIR} --answer_path=${RE_DIR} --fold_number=10 --step=${s} --task_name="aimed_constrastive_learning"


done

For the PPI task, we are using 10 fold cross-validation (as shown above), but for DDI and ChemProt, we do not have to do this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

1. Evaluation datasets for the PPI, DDI and ChemProt tasks:

2. Augmented evaluation datasets for the PPI, DDI and ChemProt tasks:

3. Contrastive pre-training procedure:

4. Fine-tuning of BERT model:

Files

README.md

Latest commit

History

README.md

File metadata and controls

1. Evaluation datasets for the PPI, DDI and ChemProt tasks:

2. Augmented evaluation datasets for the PPI, DDI and ChemProt tasks:

3. Contrastive pre-training procedure:

4. Fine-tuning of BERT model: