Correcting for experiment-specific variability in expression compendia can remove underlying signals

Alexandra J Lee, YoSon Park, Georgia Doing, Deborah A Hogan and Casey S Greene

University of Pennsylvania, Dartmouth College

This repository stores data and analysis modules to simulate compendia of gene expression data and measure the effect of technical sources of variation on our ability to extract an underlying biological signal.

Motivation: In the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into a compendium and used to extract novel biological patterns. However, combining different experiments introduces technical variance, which could distort biological patterns and lead to misinterpretation. As the scale and prevalence of these compendia increases, it becomes crucial to evaluate how integrating multiple experiments affects our ability to detect biological patterns.

Objective: To determine the extent to which underlying biological structures are masked by technical variants via simulation of a multi-experiment compendia.

Method: We used a generative multi-layer neural network to simulate a compendium of P. aeruginosa gene expression experiments. We performed a pairwise comparison of the simulated compendium versus the simulated compendium containing varying number of sources of technical variation.

Results: We found that it was difficult to detect the original biological structure of interest in a compendium containing some sources of technical variation unless we applied batch correction. Interestingly, as the number of sources of variation increased, it became easier to detect the original biological structure without correction. Furthermore, when we applied batch correction, it reduced our power to detect the biological structure of interest.

Conclusion: When combining some sources of technical variation, it is best to perform batch correction. However, as the number of sources increases, batch correction becomes unnecessary and indeed harms our ability to extract biological patterns.

Citation: For more details about the analysis, see our paper published in GigaScience. The paper should be cited as:

Alexandra J Lee, YoSon Park, Georgia Doing, Deborah A Hogan, Casey S Greene, Correcting for experiment-specific variability in expression compendia can remove underlying signals, GigaScience, Volume 9, Issue 11, November 2020, giaa117, https://doi.org/10.1093/gigascience/giaa117

Analysis Modules

There are 2 analyses using Pseudomonas dataset in the Pseudomonas directory and 2 analyses using the recount2 dataset in the Human directory:

Name	Description
Pseudomonas_sample_lvl_sim	Analysis notebook applying sample-level gene expression simulation to P. aeruginosa data
Pseudomonas_experiment_lvl_sim	Analysis notebook applying experiment-level gene expression simulation to P. aeruginosa data
Human_sample_lvl_sim	Analysis notebook applying sample-level gene expression simulation to human (recount2) data
Human_experiment_lvl_sim	Analysis notebook applying experiment-level gene expression simulation to human (recount2) data

Usage

How to run notebooks from simulate-expression-compendia

Operating Systems: Mac OS, Linux

In order to run this simulation on your own gene expression data the following steps should be performed:

First you need to set up your local repository:

Download and install github's large file tracker.
Install miniconda
Clone the simulate-expression-compendia repository by running the following command in the terminal:

git clone https://github.com/greenelab/simulate-expression-compendia.git

Note: Git automatically detects the LFS-tracked files and clones them via http. 4. Navigate into cloned repo by running the following command in the terminal:

cd simulate-expression-compendia

Set up conda environment by running the following command in the terminal:

# conda version 4.6.12
conda env create -f environment.yml

conda activate simulate_expression_compendia

pip install -e .

Navigate to either the Pseudomonas or Human directories and run the notebooks.

How to analyze your own data

In order to run this simulation on your own gene expression data the following steps should be performed:

First you need to set up your local repository and environment:

Download and install github's large file tracker.
Install miniconda
Clone the simulate-expression-compendia repository by running the following command in the terminal:

git clone https://github.com/greenelab/simulate-expression-compendia.git

Note: Git automatically detects the LFS-tracked files and clones them via http. 4. Navigate into cloned repo by running the following command in the terminal:

cd simulate-expression-compendia

Set up conda environment by running the following command in the terminal:

# conda version 4.6.12
conda env create -f environment.yml

conda activate simulate_expression_compendia

pip install -e .

Create a new analysis folder in the main directory. This is equivalent to the Pseudomonas directory
Copy Pseudomonas_sample_lvl_sim.ipynb or Pseudomonas_experiment_lvl_sim.ipynb into your analysis folder depending on if you would like to use the sample level(see simulate_by_random_sampling()) or experiment level simulation (see simulate_by_latent_transformation())approach.
Within your analysis folder create data/ directory and input/, metadata/ subdirectories

Next we need to modify the code for your analysis:

Create a configuration file in configs/ using the parameters outlined below.
Update the analysis notebooks to use your config file (see below) and input file
Add your gene expression data file to the data/input/ directory. Your data is expected to be stored as a tab-delimited dataset with samples as rows and genes as columns. Your input data is also expected to be 0-1 normalized per gene. If your data needs to be normalized or transposed, there are functions to do this in ponyo/utils.
Add your metadata file to data/metadata/ directory. Your metadata is expected to be stored as a tab-delimited with sample ids matching the gene expression dataset as one column and experiment ids as another.
Run notebooks

Additional customization

Further customization can be accomplished by doing the following:

The apply_correction_io function in the generate_data_parallel.py file can be modified to use a different correction method.
If there are additional pre-processing specific to your data, these can be added as modules in the pipeline.py file and called in the analysis notebook

Configuration file

The tables lists parameters required to run the analysis in this repository.

Note: Some of these parameters are required by the imported ponyo modules.

Name	Description
local_dir	str: Parent directory on local machine to store intermediate results.
scaler_transform_file	str: File name to store mapping from normalized to raw gene expression range. This is an intermediate file that gets generated. This file is generated in the `normalize_expression_data()` function from this ponyo script.
dataset_name	str: Name for analysis directory. Either "Human" or "Pseudomonas". If you created a new analysis directory this is the name of that new directory created in step 6 above.
simulation_type	str: "sample_lvl_sim" (simulated based on randomly sampling the latent space) or "experiment_lvl_sim" (simulation based on shifting in the latent space).
NN_architecture	str: Name of neural network architecture to use. Format 'NN__'.
learning_rate	float: Step size used for gradient descent. In other words, it's how quickly the methods is learning.
batch_size	str: Training is performed in batches. So this determines the number of samples to consider at a given time.
epochs	int: Number of times to train over the entire input dataset.
kappa	float: How fast to linearly ramp up KL loss.
intermediate_dim	int: Size of the hidden layer.
latent_dim	int: Size of the bottleneck layer.
epsilon_std	float: Standard deviation of Normal distribution to sample latent space.
validation_frac	float: Fraction of input samples to use to validate for VAE training.
num_simulated_samples	int: Simulate a compendium with this number of samples. Used if simulation_type == "sample_lvl_sample"
num_simulated_experiments	int: Simulate a compendium with this number of experiments. Used if simulation_type == "experiment_lvl_sample"
lst_num_experiments	list: List of different numbers of experiments to add to simulated compendium. These are the number of sources of technical variation that are added to the simulated compendium.
lst_num_partitions	list: List of different numbers of partitions to add to simulated compendium. These are the number of sources of technical variation that are added to the simulated compendium.
use_pca	bool: True if want to represent expression data in top PCs before calculating SVCCA similarity.
num_PCs	int: Number of top PCs to use to represent expression data. If use_pca == True.
correction_method	str: Noise correction method to use. Either "limma" or "combat".
metadata_colname	str: Column header that contains sample id that maps expression data and metadata.
iterations	int: Number of simulations to run.
num_cores	int: Number of processing cores to use.

Acknowledgements

We would like to thank YoSon Park, David Nicholson, Ben Heil and Ariel Hippen-Anderson for insightful discussions and code review

Name		Name	Last commit message	Last commit date
Latest commit History 386 Commits
.github/workflows		.github/workflows
Human		Human
Human_tests		Human_tests
Parse_metadata		Parse_metadata
Pseudo_experiments		Pseudo_experiments
Pseudomonas		Pseudomonas
Pseudomonas_tests		Pseudomonas_tests
Validate_simulations		Validate_simulations
archive		archive
configs		configs
simulate_expression_compendia_modules		simulate_expression_compendia_modules
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

License

greenelab/simulate-expression-compendia

Folders and files

Latest commit

History

Repository files navigation

Correcting for experiment-specific variability in expression compendia can remove underlying signals

Analysis Modules

Usage

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages