Skip to content

Package to study time series (or other series) multi-omics patterns of expresssion with replicates

License

Notifications You must be signed in to change notification settings

Melclic/intersectomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IntersectOmics

Method to analyze multi-omics datasets time series (or multiple condition) and uncover similarly behaving groups of biomoelcules.

To be able to combine the different omics layers, we need to convert each data to a non-parametric space. In other words, we need to find a way to remove the "memory" of the measuring process of each data type. To that end we perform a pairwise comparison between each pair of each data type and then construct a graph. We use the metric of similarity as an edge value and then perform community analysis using the weights to find cluster of biomolecules that have similar behaviors.

Multi-omics data

We will use a dataset from the following paper, that measured transcriptomics and proteomics of springtail earth worm over several days after exposure to an insecticide.

The time series data has multiple omics layers and has three replicates for each time point. Note that the data needs to be in the form of a table, where the index of the table are the names of the genes/protein/metabolite (biomolecule) and the columns represents the metadata associated with the sample.

example_input_data

Correlation with Replicates

When performing correlation analysis with replicates you are forced to take the mean of the samples. By doing so, you loose information regarding the variability of the sample. This may lead to false negative results when performing correlation analysis. We propose a method that is more robust but more computationally intensive by bootstrapping random variables extracted from fitted distributions for each replicate.

example_bootstrap

To this end, this package enables the user to perform bootstrap analysis on two vectors that contain known replicates. The process may be summarized as follows:

  1. Given two vectors with replicates, fit a normal distribution at each time points (see plot above)
  2. Loop n times and each time, sample a single value at each timepoint
  3. Calculate the correlation between the two
  4. Take the mean of the correlations and combine the p-values correcting for multiple tesing

The reported p-value is combined using the pearson method. See the scipy documentation for combined_pvalues.

TODO: As of now, we use a normal distribution, but more appropriate distributions should be used depending on the data type. For example, RNA-seq should use Poisson distribution instead. To that end, we should run R packages to process the data and extract better distribution parameters

Below are the supported correlation types:

Spearman

The default. This works very well when parameters have curvilinear relationship. In our example dataset that is time series, an increase could mean a decrease in another. Spearman correlation is the most appropriate method.

Pearson

TODO

Euclidian

TODO

Turning the Results to a Graph

The graph represents the pairwise similaritly between each biomolecule for each omics layer. For example, in the example we have a transcriptomics graph and a protemics graph. Each node is either a gene or a protein and each edge represents the correlation score.

protein_spearman_graph

Ignoring the Anti-Correlation

The goal of the analysis is to find collection of genes that behave similarly. If anticorrelations and correlations are used to contruct the graph, you would get local subgrpahs that are highly connected between biomolecules, but you would not be able to distinguish between those that behave similarly and those that do not. Below is a small example of three biomolecules that are anticorrelated to each other. By nature of the

anticorrelation_graph

anticorrelation_mistake

Graph Intersection

Now that we have multiple graphs for each omics layer, we combine them by taking the interection between each graph. This means that we keep only an edge if it exists in all of the graphs. Note that the nodes also need to be present in each of the graphs and must have the same names. If not, the result will be orphan nodes that will be removed from the resulting graph.

graph_intersection

Community Analysis

Now that we have a consensus graph, we need can analyze the results and extract groups of omics layers that behave similarly. Note that each layer may not behave the same, but each group would

To that end we use community analysis to detect groups of nodes that are well connected. We use the correlation metric of our choice as a numerical value of closeness between the two.

G_inter_example

Result

The result are collection of proteins, genes, and metabolites that are grouped together because they behave the same. Note that in the example below, the protein and genes behave the same over time, but this is not always the case. Here is an example of a single community in the above graph.

result

Inspiration

The idea to convert multiple data types to a non-parametric space and perform an intersection study has been inspired by Nikolay Oskolkov and the following github. The added features are limitations I have found when implementing the method with time series data with replicates.

About

Package to study time series (or other series) multi-omics patterns of expresssion with replicates

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published