IntersectOmics

Method to analyze multi-omics datasets time series (or multiple condition) and uncover similarly behaving groups of biomoelcules.

To be able to combine the different omics layers, we need to convert each data to a non-parametric space. In other words, we need to find a way to remove the "memory" of the measuring process of each data type. To that end we perform a pairwise comparison between each pair of each data type and then construct a graph. We use the metric of similarity as an edge value and then perform community analysis using the weights to find cluster of biomolecules that have similar behaviors.

Multi-omics data

We will use a dataset from the following paper, that measured transcriptomics and proteomics of springtail earth worm over several days after exposure to an insecticide.

The time series data has multiple omics layers and has three replicates for each time point. Note that the data needs to be in the form of a table, where the index of the table are the names of the genes/protein/metabolite (biomolecule) and the columns represents the metadata associated with the sample.

Correlation with Replicates

When performing correlation analysis with replicates you are forced to take the mean of the samples. By doing so, you loose information regarding the variability of the sample. This may lead to false negative results when performing correlation analysis. We propose a method that is more robust but more computationally intensive by bootstrapping random variables extracted from fitted distributions for each replicate.

To this end, this package enables the user to perform bootstrap analysis on two vectors that contain known replicates. The process may be summarized as follows:

Given two vectors with replicates, fit a normal distribution at each time points (see plot above)
Loop n times and each time, sample a single value at each timepoint
Calculate the correlation between the two
Take the mean of the correlations and combine the p-values correcting for multiple tesing

The reported p-value is combined using the pearson method. See the scipy documentation for combined_pvalues.

TODO: As of now, we use a normal distribution, but more appropriate distributions should be used depending on the data type. For example, RNA-seq should use Poisson distribution instead. To that end, we should run R packages to process the data and extract better distribution parameters

Below are the supported correlation types:

Spearman

The default. This works very well when parameters have curvilinear relationship. In our example dataset that is time series, an increase could mean a decrease in another. Spearman correlation is the most appropriate method.

Pearson

TODO

Euclidian

TODO

Turning the Results to a Graph

The graph represents the pairwise similaritly between each biomolecule for each omics layer. For example, in the example we have a transcriptomics graph and a protemics graph. Each node is either a gene or a protein and each edge represents the correlation score.

Ignoring the Anti-Correlation

The goal of the analysis is to find collection of genes that behave similarly. If anticorrelations and correlations are used to contruct the graph, you would get local subgrpahs that are highly connected between biomolecules, but you would not be able to distinguish between those that behave similarly and those that do not. Below is a small example of three biomolecules that are anticorrelated to each other. By nature of the

Graph Intersection

Now that we have multiple graphs for each omics layer, we combine them by taking the interection between each graph. This means that we keep only an edge if it exists in all of the graphs. Note that the nodes also need to be present in each of the graphs and must have the same names. If not, the result will be orphan nodes that will be removed from the resulting graph.

Community Analysis

Now that we have a consensus graph, we need can analyze the results and extract groups of omics layers that behave similarly. Note that each layer may not behave the same, but each group would

To that end we use community analysis to detect groups of nodes that are well connected. We use the correlation metric of our choice as a numerical value of closeness between the two.

Result

The result are collection of proteins, genes, and metabolites that are grouped together because they behave the same. Note that in the example below, the protein and genes behave the same over time, but this is not always the case. Here is an example of a single community in the above graph.

Inspiration

The idea to convert multiple data types to a non-parametric space and perform an intersection study has been inspired by Nikolay Oskolkov and the following github. The added features are limitations I have found when implementing the method with time series data with replicates.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
intersectomics		intersectomics
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

intersectomics

intersectomics

notebooks

notebooks

LICENSE

LICENSE

README.md

README.md

environment.yml

environment.yml

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

IntersectOmics

Multi-omics data

Correlation with Replicates

Spearman

Pearson

Euclidian

Turning the Results to a Graph

Ignoring the Anti-Correlation

Graph Intersection

Community Analysis

Result

Inspiration

About

Releases

Packages

Languages

License

Melclic/intersectomics

Folders and files

Latest commit

History

Repository files navigation

IntersectOmics

Multi-omics data

Correlation with Replicates

Spearman

Pearson

Euclidian

Turning the Results to a Graph

Ignoring the Anti-Correlation

Graph Intersection

Community Analysis

Result

Inspiration

About

Topics

Resources

License

Stars

Watchers

Forks

Languages