Skip to content

maclean-lab/RVAgene

Repository files navigation

DOI

RVAgene: modeling gene expression dynamics

Overview

RVAgene models gene expression dynamics in single-cell or bulk data. Read the paper here.

Requirements

  1. Python 3
  2. numpy, matplotlib, pytorch, scikit-learn, tqdm
  3. GPU (optional)

Quickstart

  • Jupyter notebook demo of RVAgene here
  • Jupyter notebook demo of processing bulk Single cell data and running RVAgene here
  • Or on the command line
    • python gen_synthetic_data.py <dataset_name> e.g. python gen_synthetic_data.py demosim
    • python train_and_gen.py <dataset_name> e.g. python train_and_gen.py demosim

Contents

data : contains example synthetic gene expression time series data with 6 inherent clusters
rvagene : contains code for recurrent variational autoencoder
train_and_gen.py : code demonstrating training RVAgene, unsupervised clustering on latent space using K-means and Generating new gene expression data by sampling and decoding points from latent space.
gen_synthetic_data.py : code to generate synthetic data with cluster structure as described in the paper.
figs : contains figures generated by the demo code.
demo.ipynb : Demonstration on the whole synthetic data generation, RVAgene training, clustering on latent space and new cluster specific data generation process
single_cell_demo.ipynb :demo of processing bulk Single cell data and running RVAgene

alt text

Model and training parameters:

sequence_length: length of the input sequence
number_of_features : number of features per timepoint per gene i.e. 1
hidden_size: hidden size of the RNN
hidden_layer_depth: number of layers in RNN (1 is enough)
latent_length: latent vector length
batch_size: number of genes in a single batch. IMPORTANT: last batch will be dropped is it is not a divisor of number of training genes.
learning_rate: the learning rate of the module
n_epochs: Number of iterations/epochs
dropout_rate: The probability of a node being dropped-out
optimizer: ADAM/ SGD optimizer to reduce the loss function
loss: SmoothL1Loss / MSELoss / ReconLoss / any custom loss which inherits from _Loss class
boolean cuda: to be run on GPU or not
print_every: The number of iterations after which loss should be printed for each epoch
boolean clip: Gradient clipping to overcome explosion
max_grad_norm: The grad-norm to be clipped if using clipping
dload: Download directory where models are to be dumped
log_file: File to log training loss , default None i.e. STDOUT

Datasets

scRNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65525 (Klein et. al. 2015) Bulk RNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98622 (Liu. et. al. , 2017)

Acknowledgments

  1. Thanks to open source implementation of recurrent VAE at https://github.com/tejaslodaya/timeseries-clustering-vae
  2. Relevant research works as cited in the work.

Contributors