Skip to content

A project to predict new repurposed drugs for dengue using a biomedical knowledge graph and graph neural networks.

License

Notifications You must be signed in to change notification settings

sayalaruano/DengueDrugRep

Repository files navigation

DengueDrugRep

PyPI - License DOI colab

Drug repurposing for dengue using a biomedicine knowledge graph and graph neural networks

Table of contents:

About the project

Dengue is a viral infection transmitted to humans through the bite of Aedes mosquitoes. This disease is a neglected tropical disease that mainly affects poor populations with no access to safe water, sanitation, and high-quality healthcare. Currently, there is no specific treatment for dengue and the focus is on treating pain symptoms. Therefore, there is an urgent need to find new drugs to treat this disease.

The goal of this project is to predict new repurposed drugs for dengue using a biomedical knowledge graph and graph neural networks. A knowledge graph (KG) is a heterogeneous network with different types of nodes and edges that incorporate semmantic information. A KG is composed of a set of triplets (subject, predicate, object) that represent relationships between entities. For example, a drug-disease triplet represents the relationship between a drug (subject) and a disease (object) through a predicate (e.g., treats, causes, etc.). The advantage of using a KG is that it allows the integration of different types of data from different sources.

Graph neural networks (GNNs) are a class of neural networks that can learn from graph data. GNNs have been used to solve different tasks in KGs, such as node classification, link prediction, and entity alignment. The drug repurposing problem can be formulated as a link prediction task in a KG. The goal is to predict new drug-disease associations for Dengue.

The following figure shows the general workflow of this project:

my alt text

Figure 1. DengueDrugRep workflow .

Dataset

The DRKG is a large-scale biomedical KG that integrates information from six existing databases: DrugBank, Hetionet, Global network of biomedical relationships (GNBR), String, IntAct, and DGIdb. This KG contains 97.238 nodes belonging to 13 entity-types (e.g., drugs, diseases, genes, etc.) and 5.874.257 triplets belonging to 107 edge-types. Also, the DRKG contains 24.313 compounds from 17 different databases (the list of databases' names is available in the Names_datasources_compounds_DRKG.csv file).

The following figure shows a schematic representation of the DRKG:

my alt text

Figure 2. Interactions in the DRKG. The number next to an edge indicates the number of relation-types for that entity-pair in the KG. Obtained from [2].

The PyKEEN library implements the DRKG as part of its datasets, so it is possible to load the DRKG directly from the library.

The DRKG was split into training, validation, and test sets. The training set contains 4.699.405 triplets, the validation set contains 587.426 triplets, and the test set contains 587.426 triplets. The training partition was used to train the models, and the validation partition was used to evaluate the models. The test dataset was used to make predictions on unseen data.

Exploratory data analysis

In this project, I focused on the drug-disease relationships in the DRKG. So, the first step was to explore what are the predicates that represent these relationships, obtaining the following list:

Drug-Disease predicate
DRUGBANK:: treats::Compound:Disease
GNBR:: C::Compound:Disease
GNBR:: J::Compound:Disease
GNBR:: Mp::Compound:Disease
GNBR:: Pa::Compound:Disease
GNBR:: Pr::Compound:Disease
GNBR:: Sa::Compound:Disease
GNBR:: T::Compound:Disease
Hetionet:: CpD::Compound:Disease
Hetionet:: CtD::Compound:Disease

More details about the predicates, their provenance and meaning are available in the relation_glossary.tsv file and the DRKG GitHub repository.

Next, I explored the number of compounds per database in the DRKG. The following figure shows the results:

my alt text

Figure 3. Distribution of compounds per database in DRKG.

The code for this part is available in the Python script EDA_DRKG_compounds_names.py.

Graph neural network models

In general, GNNs represent entities and relationships in a KG as vectors in a low-dimensional space (embeddings). Then, these vectors are scored to predict new triplets. The scoring function can be based on distance or similarity measures, depending on the type of GNN. During the training process, there is a loss function that measures the difference between the predicted and the true triplets. The goal is to minimize this loss function. Also, there is a negative generator that creates false triplets to train the model. The negative generator creates triplets by replacing the subject, predicate, or object of a true triplet with a random entity of the same type.

The following figure illustrates the general structure of a knowledge graph neural network (KGNN):

my alt text

Figure 4. Anatomy of a Knowledge Graph Neural Network. Obtained from [1].

For this project, four GNN algorithms, namely PairRE, DistMult, ERMLP, and TransR, were trained to predict new drug-disease associations using the Drug Repurposing Knowledge Graph (DRKG). These algorithms are implemented in the PyKEEN library. The models were trained using the Marging Ranking Loss function and a random seed of 1235. The rest of the hyperparameters were the default values of the library.

First, the models were trained for 50 epochs with a general evaluation procedure using all the triplets in the DRKG. In this way, the evaluation results reflected the link prediction performance for all the entity pairs in the KG. Next, new models were trained for 10 epochs with a drug repurposing evaluation procedure using only the triplets that involve drugs and diseases. Here, it was shown the link prediction performance for the task of predicting new drug-disease associations.

You can find the code for this part on the Jupyter Notebook Training_KGNN_models_Pykeen.ipynb.

The trained models are available through Zenodo with the following DOI: 10.5281/zenodo.10010151.

Evaluation

The KGNN models were evaluated: a) intrinsically, within the scope of the knowledge graph and its defined triples, and b) externally, against a ground truth (drugs on clinical trials to treat dengue) to understand their predictive power over real-world information.

Before running the evaluation scripts, you should download the trained models from Zenodo and save them in a folder called Models/.

Internal evaluation

Two standard rank-based metrics were used to measure each KGNN model’s intrinsic performance on link prediction:

  • Adjusted Mean Rank(AMR): the ratio of the Mean Rank to the Expected Mean Rank, assessing a model’s performance independently of the underlying set size. It lies on the open interval (0,2), where lower is better.
  • Hits@k: the fraction of times when the correct or “true” entity appears under the top-k entities in the ranked list. The value of hits@k is between 0 and 1. The larger the value, the better the model works. For this project, I estimated hits@1, hits@3, hits@5, and hits@10 metrics

All the internal evaluation metrics were calculated using the PyKEEN library. I reported the optimistic rank values for both the tail and head entities, which assumes that the true choice is on the first position of all those with equal score when there are multiple choices. More details about how the evaluation for KGNN models works in PyKEEN can be found here.

Check it out the code for this part of the project in the Python script Int_performance_evaluation_KGNNs.py.

The results for this part are available in the Internal_evaluation folder.

External evaluation

To validate the KGNN models externally, I analyzed the predicted ranked compound list against the drugs on clinical trials to treat dengue defined in ground truth using the following metrics:

  • First hit: the ranking position at which compounds proposed by a KGNN model match one from the ground truth database.
  • Median hit: the ranking position at which compounds proposed by a KGNN model match 50% of the compounds from the ground truth database.
  • Last hit: the ranking position at which compounds proposed by a KGNN model match all the compounds from the ground truth database.

For all these metrics, the smaller the value, the better, meaning that a model with lower “first”, “median”, or “last hit” values compared to another one, matches real-world compounds using fewer predictions.

The ground truth database was obtained from the ClinicalTrials.gov website. I searched for clinical trials that use drugs to treat dengue. I found 21 clinical trials that use 16 drugs to treat this disease. Also, I look for the IDs of these drugs from the 17 compound databases in the DRKG using the CHEMBL API and manual validation. The list of the drugs in the ground truth database and their IDs in the compound databases of DRKG are available in the dengue_validated_drugs_clin.csv.

You can find the code for this part on the Python script External_performance_evaluation_KGNNs.py.

The results for this part are available in the External_evaluation and CompoundDisease_predictions folders.

How to set up the environment to run the code?

I used Pipenv to create a Python virtual environment, which allows the management of python libraries and their dependencies. Each Pipenv virtual environment has a Pipfile with the names and versions of libraries installed in the virtual environment, and a Pipfile.lock, a JSON file that contains versions of libraries and their dependencies.

To create a Python virtual environment with libraries and dependencies required for this project, you should clone this GitHub repository, open a terminal, move to the folder containing this repository, and run the following commands:

# Install pipenv
$ pip install pipenv

# Create the Python virtual environment 
$ pipenv install

# Activate the Python virtual environment 
$ pipenv shell

You can find a detailed guide on how to use pipenv here.

Alternatively, you can create a conda virtual environment with the required libraries using the requirements.txt file. To do this, you should clone this GitHub repository, open a terminal, move to the folder containing this repository, and run the following commands:

# Create the conda virtual environment
$ conda env create --name denguedrugrep --file requirements.txt

# Activate the conda virtual environment
$ conda activate denguedrugrep

Structure of the repository

The main files and directories of this repository are:

File Description
Data/ Folder with summary of the entities and relationships in DRKG and a csv file of the drugs in clinical trials to treat dengue
Scripts/ Folder with the Python scripts to train and evaluate the KGNN models
Results/ Folder to save performance metrics and other outputs of the KGNN models
img/ images and gifs
DengueDrugRep.pdf Presentation with detailed explanation of the project
Pipfile File with names and versions of packages installed in the virtual environment
requeriments.txt File with names and versions of packages installed in the virtual environment
Pipfile.lock Json file that contains versions of packages, and dependencies required for each package

Credits

Further details

More details about the biological background of the project, the interpretation of the results, and ideas for further work are available in this presentation.

Contact

If you have comments or suggestions about this project, you can open an issue in this repository, or email me at sebasar1245@gamil.com.

About

A project to predict new repurposed drugs for dengue using a biomedical knowledge graph and graph neural networks.

Topics

Resources

License

Stars

Watchers

Forks