RVAgene: modeling gene expression dynamics

Overview

RVAgene models gene expression dynamics in single-cell or bulk data. Read the paper here.

Requirements

Python 3
numpy, matplotlib, pytorch, scikit-learn, tqdm
GPU (optional)

Quickstart

Jupyter notebook demo of RVAgene here
Jupyter notebook demo of processing bulk Single cell data and running RVAgene here
Or on the command line
- python gen_synthetic_data.py <dataset_name> e.g. python gen_synthetic_data.py demosim
- python train_and_gen.py <dataset_name> e.g. python train_and_gen.py demosim

data : contains example synthetic gene expression time series data with 6 inherent clusters
rvagene : contains code for recurrent variational autoencoder
train_and_gen.py : code demonstrating training RVAgene, unsupervised clustering on latent space using K-means and Generating new gene expression data by sampling and decoding points from latent space.
gen_synthetic_data.py : code to generate synthetic data with cluster structure as described in the paper.
figs : contains figures generated by the demo code.
demo.ipynb : Demonstration on the whole synthetic data generation, RVAgene training, clustering on latent space and new cluster specific data generation process
single_cell_demo.ipynb :demo of processing bulk Single cell data and running RVAgene

Model and training parameters:

sequence_length: length of the input sequence
number_of_features : number of features per timepoint per gene i.e. 1
hidden_size: hidden size of the RNN
hidden_layer_depth: number of layers in RNN (1 is enough)
latent_length: latent vector length
batch_size: number of genes in a single batch. IMPORTANT: last batch will be dropped is it is not a divisor of number of training genes.
learning_rate: the learning rate of the module
n_epochs: Number of iterations/epochs
dropout_rate: The probability of a node being dropped-out
optimizer: ADAM/ SGD optimizer to reduce the loss function
loss: SmoothL1Loss / MSELoss / ReconLoss / any custom loss which inherits from _Loss class
boolean cuda: to be run on GPU or not
print_every: The number of iterations after which loss should be printed for each epoch
boolean clip: Gradient clipping to overcome explosion
max_grad_norm: The grad-norm to be clipped if using clipping
dload: Download directory where models are to be dumped
log_file: File to log training loss , default None i.e. STDOUT

Datasets

scRNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65525 (Klein et. al. 2015) Bulk RNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98622 (Liu. et. al. , 2017)

Acknowledgments

Thanks to open source implementation of recurrent VAE at https://github.com/tejaslodaya/timeseries-clustering-vae
Relevant research works as cited in the work.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
esc_data		esc_data
figs		figs
rvagene		rvagene
.gitignore		.gitignore
ESC_corr_info.txt		ESC_corr_info.txt
ESC_genes.txt		ESC_genes.txt
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
esc_loss_log.txt		esc_loss_log.txt
gen_synthetic_data.py		gen_synthetic_data.py
single_cell_demo.ipynb		single_cell_demo.ipynb
train_and_gen.py		train_and_gen.py

License

maclean-lab/RVAgene

Folders and files

Latest commit

History

Repository files navigation

RVAgene: modeling gene expression dynamics

Overview

Requirements

Quickstart

Contents

Model and training parameters:

Datasets

Acknowledgments

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages