This repo contains the source code and dataset for the PhenoTagger.
PhenoTagger is a hybrid method that combines dictionary and deep learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. It is an ontology-driven method without requiring any manually labeled training data, as that is expensive and annotating a large-scale training dataset covering all classes of HPO concepts is highly challenging and unrealistic. Please refer to our paper for more details:
- Use Transformers instead of kears-bert to load deep learning models.
- Use Tensorflow.keras instead of Keras.
- Add a PubMedBERT model.
- Re-train phenotype models using the newest version of HPO (hp/releases/2023-10-09)
- add negation and uncertainty detection function using NegBio
(2022-11-24)
- build a app demo for PhenoTagger
(2022-05-10):
- Fix some bugs to speed up the processing time.
- Add a Bioformer model (a light weight BERT in biomedical domain).
- Re-train phenotype models using the newest version of HPO (hp/releases/2022-04-14)
- Dependency package
- Data and model preparation
- Instructions for tagging text with PhenoTagger
- Instructions for training PhenoTagger
- Performance on HPO GSC+
- Citing PhenoTagger
PhenoTagger has been tested using Python3.10 on CentOS and uses the following dependencies on a CPU and GPU:
To install all dependencies automatically using the command:
$ pip install -r requirements.txt
-
To run this code, you need to create a model folder named "models" in the PhenoTagger folder, then download the model files ( four trained models for HPO concept recognition are released, i.e., CNN, Bioformer, BioBERT, PubMedBERT) into the model folder.
- First download original files of the pre-trained language models (PLMs): Bioformer, BioBERT, PubMedBERT
- Then download the fine-tuned model files for HPO in Here
-
The corpora used in the experiments are provided in /data/corpus.zip. Please unzip the file, if you need to use them.
You can use our trained PhenoTagger to identify the HPO concepts from biomedical texts by the PhenoTagger_tagging.py file.
The file requires 2 parameters:
- --input, -i, help="the folder with input files"
- --output, -o, help="output folder to save the tagged results"
The file format can be in BioC(xml) or PubTator(tab-delimited text file) (click here to see our format descriptions). There are some examples in the /example/ folder.
Example:
$ python PhenoTagger_tagging.py -i ../example/input/ -o ../example/output/
We also provide some optional parameters for the different requirements of users in the PhenoTagger_tagging.py file.
para_set={
'model_type':'bioformer', # three deep learning models are provided: cnn, bioformer, biobert or pubmedbert
'onlyLongest':False, # False: return overlapping concepts; True: only return the longgest concepts in the overlapping concepts
'abbrRecog':True, # False: don't identify abbreviation; True: identify abbreviations
'negation': True, # True:negation detection
'ML_Threshold':0.95, # the Threshold of deep learning model
}
The file requires 3 parameters:
- --input, -i, help="input the ontology .obo file"
- --output, -o, help="the output folder of dictionary"
- --rootnode, -r, help="input the root node of the ontogyly"
Example:
$ python Build_dict.py -i ../ontology/hp.obo -o ../dict/ -r HP:0000118
After the program is finished, 6 files will be generated in the output folder.
- id_word_map.json
- lable.vocab
- noabb_lemma.dic
- obo.json
- word_id_map.json
- alt_hpoid.json
The file requires 4 parameters:
- --dict, -d, help="the input folder of the ontology dictionary"
- --fileneg, -f, help="the text file used to generate the negatives" (You can use our negative text "mutation_disease.txt" )
- --negnum, -n, help="the number of negatives, we suggest that the number is the same with the positives."
- --output, -o, help="the output folder of the distantly-supervised training dataset"
Example:
$ python Build_distant_corpus.py -d ../dict/ -f ../data/mutation_disease.txt -n 10000 -o ../data/distant_train_data/
After the program is finished, 3 files will be generated in the outpath:
- distant_train.conll (distantly-supervised training data)
- distant_train_pos.conll (distantly-supervised training positives)
- distant_train_neg.conll (distantly-supervised training negatives)
The file requires 4 parameters:
- --trainfile, -t, help="the training file"
- --devfile, -d, help="the development set file. If don't provide the dev file, the training will be stopped by the specified EPOCH"
- --modeltype, -m, help="the deep learning model type (cnn, biobert, pubmedbert or bioformer?)"
- --output, -o, help="the output folder of the model"
Example:
$ python PhenoTagger_training.py -t ../data/distant_train_data/distant_train.conll -d ../data/corpus/GSC/GSCplus_dev_gold.tsv -m bioformer -o ../models/
After the program is finished, 2 files will be generated in the output folder:
- cnn.h5/biobert.h5 (the trained model)
- cnn_dev_temp.tsv/biobert_dev_temp.tsv (the prediction results of the development set, if you input a development set file)
If you're using PhenoTagger, please cite:
- Ling Luo, Shankai Yan, Po-Ting Lai, Daniel Veltri, Andrew Oler, Sandhya Xirasagar, Rajarshi Ghosh, Morgan Similuk, Peter N Robinson, Zhiyong Lu. PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology. Bioinformatics, Volume 37, Issue 13, 1 July 2021, Pages 1884–1890.