Skip to content

MichaelBatskinis95/Graph-DB-4-Hematological-Data

Repository files navigation

Graph-DB-4-Hematological-Data

A graph-based framework, based on graph database technologies, to facilitate storage, retrieval, and exploration for hematological and biological graph data.

Workflow

Important Note

Pre-processing of physiological data (for more info see phys_data_preprocess.ipynb), along with the extraction of computationally verified data (External_sources.ipynb) were performed in Python.

1. Data Collection & Preperation

Final Dataset = Experimental Data + Computationally Verified Data

Experimental Data

Data about the metabolic, proteomic & physiological profile of 6 G6PD deficient & 1 control donor were retrieved.

Metabolic Data

7 weekly samplings, 295 metabolites

For each metabolite the following information were collected:

  • official name
  • ID in Kegg Database
  • metabolic pathway to which it belongs and
  • abundances in both G6PD deficient and control donors

Physiological Data

7 weekly samplings, 83 physiological parameters

For each physiological parameter the following information were collected:

  • official name
  • abundances in both G6PD deficient and control donors

Proteomic Profile

3 weekly pooled samplings, 934 proteins

For each protein the following information were collected:

  • official name
  • related gene
  • accession number (AC) in UniprotKB/Swissprot db
  • abundances in both G6PD deficient and control donors

Computationally Verified Data

Data retrieved from the following databases were used for the qualitative enrichment of the Knowledge Graph:

  • String: 241 protein interactions between G6PD and other related proteins
  • STITCH: 453 protein-chemical or chemical-chemical interactions
  • Ensembl: 39 records about diseases related to G6PD or proteins closely associated with it
  • Human Protein Atlas (HPA): 27 additional diseases

Data Pre-processing and Curation

The following issues were fixed in this stage:
✅ missing values
✅ entries with insufficient information
✅ duplicate entries from different sources

2. Graph Database for Bio/Hematological Networks

The construction of the knowledge graph, that highlights the associations between different bio/hematological parameters related to the G6PD enzyme, which is the key component of the biological issue that was investigated, was conducted in Neo4j graph database.

2.1 Query Requirements

The first step towards the construction of the bio/hematological data networks was to determine user requirements. Those requirements will drive the construction of a knowledge graph that could explain interactions or, better yet, reveal potential associations between different parameters.

First Group of Requirements: Biologically Converged Parameters

  • Spectating inter- and intra- parameter associations in all possible combinations → Gives insight about interactions between one or more different data types

  • Determination of crucial parameters → Utilizing Centrality algorithms to gain insight about the most popular nodes of the – case study – system (hub nodes)

  • Identification of converged metabolites based on the storage timeline of RBCs → Can be achieved by dividing the seven samplings into three – storage based – groups (early, mid, and late storage) and then performing correlation analysis (e.g using Pearson's Similarity algorithm) to identify the most converged components.


Second Group of Requirements: Data Visualization and Subnetworks Representation

  • Graph representation based on specific properties of the case study system → Utilizing filtering tools to gain insight about specific components

  • Detection of communities → Can be achieved by perfoming community detection analysis (e.g. using Louvain method)
  • Focusing on clusters/subnetworks → Refers to the subsequent analysis after the detection of communities.

Third Group of Requirements: Comparative Analysis of Donors’ Metabolic Profile

  • Comparing donors’ metabolic profile in pairs → Answering to this question could highlight either the homogeneity or heterogeneity of the system, since all donors were tested under the same conditions (Can be achieved by applying pairwise comparison of donors' metabolic profile.

  • Investigating the impact of storage to RBCs’ metabolic profile → The purpose of this query is to gain insight about the effect of storage to RBCs’ vitality and functionality. Comparing the in vivo system of each donor (D0) with the in vitro system (D7 – D42) could reveal the critical storage period at which the functionality of RBCs starts to disrupt.

2.2 The Graph Data Model

In total, the Bio/Hematological Data Network consists of 933 nodes, divided in 9 general groups and 87,790 relationships, arranged in 15 distinct types. The proposed graph data model consists of the following node types: compounds divided in 33 different subgroups depending on the metabolic pathway the belong to, physiological parameters, proteomics, donor, stitch data, string data, Ensembl, disease and G6PD. Below an infographic representation of the graph data model, that shows which node types interact with nodes types, is shown:

2.3 Bio/Hematological Data Analysis

After importing all necessary data to the network, statistical analysis using graph-related algorithms, to filter the most statistically significant parameters of the network, took place. The process that was followed 1) started with finding a suitable approach to explore the data that were available, 2) followed by setting a proper threshold, so that the outcome would be accurate enough and 3) concluded with filtering out biologically converged intra- and inter- parameter relationships.

Approach

Two algorithms were applied during the statistical analysis: Pearson Similarity algorithm and Cosine Similarity algorithm.

Pearson Similarity Algorithm

Applied for the characterization of significant intra- and inter- parameters associations between different datasets, such as:

  • Compound Similarities
  • Metabolites associated with G6PD
  • Physiological Parameter – Compound Similarities
  • Protein Similarities
  • Protein – Compounds Similarities

$$ Pearson's Similarity(A,B) = {cov(A,B) \over {σ_{A} \times σ_{Β}}} = {{\sum_{i=1}^n (A_{i}-\overline A)(B_{i}-\overline B)} \over {\sum_{i=1}^n (A_{i}-\overline A)^2(B_{i}-\overline B)^2}} $$

Cosine Similarity Algorithm

Used for:

  • Comparison of donors’ metabolic profile
  • Effect of time in G6PD- donor’s RBCs

$$ Cosine Similarity(A,B) = {A\bullet B \over {\lVert \mathbf{A} \rVert \times \lVert \mathbf{B} \rVert}} = {{\sum_{i=1}^n A_{i} \times B_{i}} \over \sqrt\sum_{i=1}^n A_{i}^2 \times \sqrt\sum_{i=1}^n B_{i}^2}$$

Setting the Threshold

After properly estimating Pearson Similarity scores, the filtering of the most significant intra- and inter- parameter correlations took place. To achieve that, a threshold was set, so that statistically significant associations will be distinguished. The value of the threshold varied in each case, depending on the size of the case study dataset or the number of samplings. It is important to mention that this step was applied only in cases where Pearson Similarity algorithm were used, since Cosine Similarity was used only for purposes of estimating the percentage of identity between compared groups. In the following figure thresholds of all intra- and -inter- parameter associations are shown.

Filtering biologically converged correlations

By applying the threshold that was mentioned above, the most insignificant associations between different node types were excluded from any further analysis. However, a stricter approach was necessary, to proceed with the filtering of biologically converged correlations. For this reason, the repeatability score was applied. As its name suggests, repeatability score explores the times an event occurs. In our case, the event, that was tested, was the correlation between two variables. Therefore, if a case study pair of variables passed the repeatability score, the relationship that is formed between them would be considered biologically converged. From this process the following converged relationships were identified:

  • metabolites related to G6PD (r.s* ≥ 4 (max 7))
  • biologically converged correlations between metabolites (r.s ≥ 4 (max 7))
  • biologically converged relationships between metabolites and physiological parameters (r.s* ≥ 25 % of theoretically possible combinations**)

*r.s: repeatbility score
**theoretically possible combinations: 7 (samplings for physiological data) • 7(sampling for metabolic data) = 49

2.4 The final Knowledge Graph

By assembling the outcome of what was described in Sections 2.1, 2.2 and 2.3 the final knowledge graph can be generated. We could describe the bio/hematological data network as a network of two layers. The first layer consists of the pre-processed experimental data along with all correlations that were mentioned in Section 2.3, while the second layer includes external data sources (nodes, relationships, and properties) that enrich the length and depth of the knowledge graph by adding more detailed information regarding proteins and metabolites related – directly or indirectly – to G6PD.

3. Data Exploration for Bio/Hematological Networks

To facilitate data exploration on bio/hematological data, we adopted the GraphXR tool. which provides effective visualization capabilities especially for users without an IT background. Using GraphXR we applied several graph-related techniques to highlight significant inter- and intra- parameter associations, identify crucial components and discover communities that are formed within different subgraphs.

3.1 Investigating Intra- and Inter- Parameter Associations

Since the bio/hematological data network was set up to investigate homologous and heterologous correlations between different components and to answer to a set of questions related to this biological issue, a first approach regarding the exploration analysis could be to spectate specific relationships of the graph at will, depending on the question we want to answer. That said, a representative case could be to collect and, subsequently, display all G6PD-related metabolites along with compounds that are highly correlated with. To do so, we need to display only relationship types regarding:

a) G6PD-related components (relationship type: associated with) and

b) biologically converged metabolites (relationship type: bio converged compounds)

Following that, since we are interested in metabolites closely associated with G6PD and their first neighbours, we need to apply techniques related to tracing neighbours so that we can pick all unnecessary components of the subnetwork and subtract them from the graph space. By doing that only components G6PD-related along with their closest associates are shown on the graph space. Part of the outcome of this exploration analysis is presented in the following. All nodes are dashed with different colours according to the node type to which they belong. Dashed in salmon pink colour relationships between G6PD (central node, dashed in blue) and metabolites are shown, while marked with blue colour biologically converged associations of G6PD-related compounds are presented. An alternative option of what was described above could be to display the desired graph using Cypher queries.

3.2 Identifying Crucial Parameters

To highlight the most popular components of any case-study network displayed on graph space, several centrality measures of the network need to be estimated, so that any finding, that might be derived, would be more trustworthy. A case of major importance could be to identify the most crucial components concerning the metabolic profile or their interconnections with the physiological and proteomic profile of G6PD- donors. For the characterization of such components Betweeness Centrality (BC) and Closeness Centrality (CC) metrics were used as a guide. Resulting BC and CC values of the case-study network were further investigated by normalizing the corresponding values to scale of [0,1], excluding graph entities with insignificant values (BC < 0.10 and CC <0.10) and further visualizing the most significant by applying more responsive techniques in Tableau. In the following figure the normalized values of BC and CC of the most significant metabolites are presented. Metabolites are considered crucial for the network in the case they have relatively high BC and CC scores. Such components could be characterized as central nodes of the biologically converged compounds, indicating that they might play some role in the metabolic profile of G6PD- donors.

The above tree map presents the most biologically converged metabolites that are related to G6PD deficiency, according to their BC and CC scores. The colour scaling is relative to the CC score, which means that the darker the colour of a component the higher its CC score. Additionally, the size of each box is relative to the BC of graph entities. In that sense, the bigger the box of component that higher its BC score.

Another informative example of such an exploration analysis could be the characterization of crucial parameters amongst metabolites and physiological parameters. Since the parts of computing centralities, pre-processing, and preparing for visualization via heatmaps are like the first case, we will discuss the outcome of the analysis. In the following figure the outcome of the exploration analysis, that was conducted for the characterization of the most significant G6PD-related components, is presented. The figure displays in the form of packed bubbles the most biologically converged components, according to their BC and CC scores. The colour scaling is relative to the CC score, which means that the darker the colour of a component the higher its CC score. Additionally, the size of each circle is relative to the BC score of graph entities. In that sense, the bigger the circle of a component the higher its BC score.

One can easily notice that even though most of the displayed parameters have similar closeness values, some of them can be distinguished as more noteworthy due to their high betweenness measure. More specifically, mechanical fragility (MFI), osmotic fragility (MCF and MCF_37) and antioxidant capacity (TAC and TAC_UA) of stored RBCs seem to be these parameters that are more central to the network. This finding depicts some of the primary characteristics of RBCs, which are related to their sustainability to mechanical and oxidative stress. At any time, these markers can give insight about the RBC’s integrity since high levels of MFI or MFC are related with RBC aging and subsequently hemolysis.

3.3 Detecting Communities

The last case study that will be presented concerns the detection of communities in a graph and consists of two parts. The first part of this case is related to the process that is followed to detect communities that are formed in the subnetwork of biological converged components and work with some of them. For the identification of communities, the Louvain method is used. In the following figure a detailed walkthrough for detecting communities using GraphXR is presented.

The second part of this process is related to the exploration analysis that can take place once we have selected a cluster to work with. In this case we decided to work with the cluster that is formed around the Physiological Parameter “MFI_37”, which, as described in Section 3.2., is one of the crucial parameters for the biological issue that we studied. Figure 10 presents the cluster that was detected for the graph entity “MFI_37” emphasising in its first neighbours. Following such exploratory approaches one can gain insight about potential effects between connected components.

About

Cypher Queries regarding Hematological Markers Network Markers DB in Neo4j

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published