PhenoTagger v1.2

This repo contains the source code and dataset for the PhenoTagger.

PhenoTagger is a hybrid method that combines dictionary and deep learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. It is an ontology-driven method without requiring any manually labeled training data, as that is expensive and annotating a large-scale training dataset covering all classes of HPO concepts is highly challenging and unrealistic. Please refer to our paper for more details:

Ling Luo, Shankai Yan, Po-Ting Lai, Daniel Veltri, Andrew Oler, Sandhya Xirasagar, Rajarshi Ghosh, Morgan Similuk, Peter N Robinson, Zhiyong Lu. PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology. Bioinformatics, Volume 37, Issue 13, 1 July 2021, Pages 1884–1890.

Updates (2023-12-19):

Use Transformers instead of kears-bert to load deep learning models.
Use Tensorflow.keras instead of Keras.
Add a PubMedBERT model.
Re-train phenotype models using the newest version of HPO (hp/releases/2023-10-09)
add negation and uncertainty detection function using NegBio

(2022-11-24)

build a app demo for PhenoTagger

(2022-05-10):

Fix some bugs to speed up the processing time.
Add a Bioformer model (a light weight BERT in biomedical domain).
Re-train phenotype models using the newest version of HPO (hp/releases/2022-04-14)

Content

Dependency package
Data and model preparation
Instructions for tagging text with PhenoTagger
Instructions for training PhenoTagger
Performance on HPO GSC+
Citing PhenoTagger

Dependency package

PhenoTagger has been tested using Python3.10 on CentOS and uses the following dependencies on a CPU and GPU:

To install all dependencies automatically using the command:

$ pip install -r requirements.txt

Data and model preparation

To run this code, you need to create a model folder named "models" in the PhenoTagger folder, then download the model files ( four trained models for HPO concept recognition are released, i.e., CNN, Bioformer, BioBERT, PubMedBERT) into the model folder.
- First download original files of the pre-trained language models (PLMs): Bioformer, BioBERT, PubMedBERT
- Then download the fine-tuned model files for HPO in Here
The corpora used in the experiments are provided in /data/corpus.zip. Please unzip the file, if you need to use them.

Tagging free text with PhenoTagger

You can use our trained PhenoTagger to identify the HPO concepts from biomedical texts by the PhenoTagger_tagging.py file.

The file requires 2 parameters:

--input, -i, help="the folder with input files"
--output, -o, help="output folder to save the tagged results"

The file format can be in BioC(xml) or PubTator(tab-delimited text file) (click here to see our format descriptions). There are some examples in the /example/ folder.

Example:

$ python PhenoTagger_tagging.py -i ../example/input/ -o ../example/output/

We also provide some optional parameters for the different requirements of users in the PhenoTagger_tagging.py file.

para_set={
'model_type':'bioformer',   # three deep learning models are provided: cnn, bioformer, biobert or pubmedbert
'onlyLongest':False,  # False: return overlapping concepts; True: only return the longgest concepts in the overlapping concepts
'abbrRecog':True,    # False: don't identify abbreviation; True: identify abbreviations
'negation': True, # True:negation detection
'ML_Threshold':0.95,  # the Threshold of deep learning model
}

Training PhenoTagger with a new ontology

1. Build the ontology dictionary using the Build_dict.py file

The file requires 3 parameters:

--input, -i, help="input the ontology .obo file"
--output, -o, help="the output folder of dictionary"
--rootnode, -r, help="input the root node of the ontogyly"

Example:

$ python Build_dict.py -i ../ontology/hp.obo -o ../dict/ -r HP:0000118

After the program is finished, 6 files will be generated in the output folder.

id_word_map.json
lable.vocab
noabb_lemma.dic
obo.json
word_id_map.json
alt_hpoid.json

2. Build the distantly-supervised training dataset using the Build_distant_corpus.py file

The file requires 4 parameters:

--dict, -d, help="the input folder of the ontology dictionary"
--fileneg, -f, help="the text file used to generate the negatives" (You can use our negative text "mutation_disease.txt" )
--negnum, -n, help="the number of negatives, we suggest that the number is the same with the positives."
--output, -o, help="the output folder of the distantly-supervised training dataset"

Example:

$ python Build_distant_corpus.py -d ../dict/ -f ../data/mutation_disease.txt -n 10000 -o ../data/distant_train_data/

After the program is finished, 3 files will be generated in the outpath:

distant_train.conll (distantly-supervised training data)
distant_train_pos.conll (distantly-supervised training positives)
distant_train_neg.conll (distantly-supervised training negatives)

3. Train PhenoTagger using the PhenoTagger_training.py file

The file requires 4 parameters:

--trainfile, -t, help="the training file"
--devfile, -d, help="the development set file. If don't provide the dev file, the training will be stopped by the specified EPOCH"
--modeltype, -m, help="the deep learning model type (cnn, biobert, pubmedbert or bioformer?)"
--output, -o, help="the output folder of the model"

Example:

$ python PhenoTagger_training.py -t ../data/distant_train_data/distant_train.conll -d ../data/corpus/GSC/GSCplus_dev_gold.tsv -m bioformer -o ../models/

After the program is finished, 2 files will be generated in the output folder:

cnn.h5/biobert.h5 (the trained model)
cnn_dev_temp.tsv/biobert_dev_temp.tsv (the prediction results of the development set, if you input a development set file)

Citing PhenoTagger

If you're using PhenoTagger, please cite:

Ling Luo, Shankai Yan, Po-Ting Lai, Daniel Veltri, Andrew Oler, Sandhya Xirasagar, Rajarshi Ghosh, Morgan Similuk, Peter N Robinson, Zhiyong Lu. PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology. Bioinformatics, Volume 37, Issue 13, 1 July 2021, Pages 1884–1890.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
dict		dict
example/input		example/input
ontology		ontology
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

dict

dict

example/input

example/input

ontology

ontology

src

src

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

PhenoTagger v1.2

Updates (2023-12-19):

Content

Dependency package

Data and model preparation

Tagging free text with PhenoTagger

Training PhenoTagger with a new ontology

1. Build the ontology dictionary using the Build_dict.py file

2. Build the distantly-supervised training dataset using the Build_distant_corpus.py file

3. Train PhenoTagger using the PhenoTagger_training.py file

Citing PhenoTagger

About

Releases

Packages

Languages

License

DUTIR-BioNLP/PhenoTagger-Updates

Folders and files

Latest commit

History

Repository files navigation

PhenoTagger v1.2

Updates (2023-12-19):

Content

Dependency package

Data and model preparation

Tagging free text with PhenoTagger

Training PhenoTagger with a new ontology

1. Build the ontology dictionary using the Build_dict.py file

2. Build the distantly-supervised training dataset using the Build_distant_corpus.py file

3. Train PhenoTagger using the PhenoTagger_training.py file

Citing PhenoTagger

About

Topics

Resources

License

Stars

Watchers

Forks

Languages