Deep Exploration in Sequence Generation

Code for training deep generative models of DNA, RNA and protein sequences in Keras. Implements activation-maximizing generative neural networks (Deep Exploration Networks, or DENs) which are optimized with respect to a downstream fitness predictor. DENs explicitly maximize sequence diversity by sampling two independent patterns at each forward pass of backpropagation and imposing a similarity penalty on those samples. DENs optionally maintain the confidence in generated sequences by incorporating variational autoencoders (VAEs) to estimate sequence likelihood. Likelihood is approximated by importance sampling and gradients are backpropagated from the VAE to the DEN using straight-through (ST) gradients.

DENs are trained to jointly maximize sequence diversity and predicted fitness. The framework was first presented in a MLCB 2019* conference paper, "Deep exploration networks for rapid engineering of functional DNA sequences". An extensive description and analysis of DENs was published in Linder et al, Cell Systems 2020.

*1st Conference on Machine Learning in Computational Biology, (MLCB 2019), Vancouver, Canada.

Contact jlinder2 (at) cs.washington.edu for any questions about the model or data.

Highlights

Deep generative neural networks for DNA, RNA & protein sequences.
Train the generator to maximize both diversity and fitness.
Fitness is evaluated by a user-supplied sequence-predictive model and cost function.

Features

Implements deep convolutional- and residual generative neural networks.
Supports vanilla, class-conditional and inverse-regression generators.
Generators support one-hot sampling, enabling end-to-end training via straight-through gradients.
Maintains sequence likelihood during training by importance sampling of a pre-trained variational autoencoder.

Installation

Install by cloning or forking the github repository:

git clone https://github.com/johli/genesis.git
cd genesis
python setup.py install

Required Packages

Tensorflow >= 1.13.1
Keras >= 2.2.4
Scipy >= 1.2.1
Numpy >= 1.16.2
Isolearn >= 0.2.0 (github)

Saved Models

To aid reproducibility, we provide access to all trained models via the google drive link below:

Model Repository

Unzip each compressed model archive in the corresponding sub-folder of the analysis root folder in the Github repo. Below follows a brief inventory of the google drive repository:

apa/

Models for generating sequences with target/maximal APA isoform proportions and cleavage. Contains several model versions used in different benchmark comparisons.

splicing/

Models for generating sequences with target (differential) 5' splice donor usage proportions.

mpradragonn/

Models for maximizing the MPRA-DragoNN predictor (transcriptional activity).

dragonn/

Models for maximizing the DragoNN predictor (SPI1 transcription factor binding).

fbgan/

Contains pre-trained GAN and FB-GAN models used in benchmark comparisons.

gfp/

Contains training data, pre-trained models and generated results for the GFP benchmark comparison.

Training & Analysis Notebooks

The following jupyter notebooks contain all of the training code & analyses that were part of the paper. We used the following fitness predictors in our analyses: APARENT (Bogard et. al., 2019), DragoNN (Kundaje Lab), MPRA-DragoNN (Movva et. al., 2019) and our own (Cell line-specific splicing predictor). For some of the benchmarks, we use the Feedback-GAN code (Gupta et. al., 2019; Github) and CbAS code (Brookes et. al., 2019; Github).

Alternative Polyadenylation

Training and evaluation of Exploration networks for engineering Alternative Polyadenylation signals.

Notebook 0a: APA VAE Training Script (beta = 0.15) (Not Annealed)
Notebook 0b: APA VAE Training Script (beta = 0.65) (Not Annealed | Annealed)
Notebook 0c: APA VAE Training Script (beta = 0.85) (Annealed)
(Note: The non-annealed version with beta = 0.65 is used in Notebook 6c below.)

Notebook 1a: Engineering APA Isoforms (ALIEN1 Library)
Notebook 1b: Engineering APA Isoforms (ALIEN2 Library)
Notebook 1c: Engineering APA Isoforms (TOMM5 Library)
Notebook 2a: Significance of Diversity Cost (ALIEN1 Library)
Notebook 2b: Significance of Diversity Cost (ALIEN2 Library)
Notebook 3a: PWMs or Straight-Through Approximation?
Notebook 3b: PWMs or Straight-Through Approximation? (Entropy Penalty)
Notebook 4: Engineering Cleavage Position (ALIEN1 Library)
Notebook 5: Inverse APA Isoform Regression (ALIEN1 Library)
Notebook 6a: Maximal APA Isoform (Sequence Diversity) (Earthmover)
Notebook 6b: Maximal APA Isoform (Latent Diversity) (Earthmover)
Notebook 6c: Evaluate Diversity Costs (Sequence & Latent)
Notebook 7a: Benchmark Comparison
Notebook 7b: Benchmark Comparison (Computational Cost)

Below are two notebooks that were not included in the main paper, but are kept here as additional analysis for interested users. The first notebook trains a Wasserstein-GAN on APA sequences from the ALIEN1 dataset. The second notebook trains a DEN that learns to produce a distribution of optimal GAN seeds which result in maximally strong, diverse APA sequences. It is a means of conditioning a pre-trained GAN.

Extra 1: APA Sequence GAN (ALIEN1)
Extra 2: Max APA Isoform GANception (ALIEN1)

Alternative Polyadenylation (Likelihood-bounded)

Addtional examples of engineering Alternative Polyadenylation signals using Likelihood-bounded Exploration networks. We combine importance sampling of a variational autoencoder (VAE) and straight-through approximation to propagate likelihood gradients to the generator.

Notebook 0: Evaluate Variational Autoencoders (Not Annealed)
Notebook 0a: VAE Training Script (Weak APA - Not Annealed, beta = 1.0)
Notebook 0b: VAE Training Script (Strong APA - Not Annealed, beta = 1.0)
(Note: These non-annealed versions with beta = 1.0 are used in Notebooks 1 and 2 below.)

Notebook 0*: Evaluate Variational Autoencoders (Annealed)
Notebook 0a*: VAE Training Script (Weak APA - Annealed) (beta = 1.0 | beta = 1.125 | beta = 1.25 | beta = 1.5)
Notebook 0b*: VAE Training Script (Strong APA - Annealed) (beta = 1.0 | beta = 1.125 | beta = 1.25 | beta = 1.5)
(Note: These versions are not used in downstream analyses, but included to show that beta-annealing does not significantly improve separability between Strong / Weak APA test sets. Compare to non-annealed VAEs with beta = 1.0 above.)

Notebook 1: Evaluate Likelihood-bounded DENs (Weak VAE)
Notebook 1a/b/c/d: DEN Training Scripts (Weak VAE) (Only Fitness | Margin -2 | Margin 0 | Margin +2)
Notebook 2: Evaluate Likelihood-bounded DENs (Strong VAE)
Notebook 2a/b/c/d: DEN Training Scripts (Strong VAE) (Only Fitness | Margin -2 | Margin 0 | Margin +2)

Alternative Splicing

Training and evaluation of Exploration networks for engineering (differential) Alternative Splicing.

Notebook 1: Engineering Splicing Isoforms (HEK)
Notebook 2: Engineering De-Novo Splice Sites (HEK)
Notebook 3a: Differential - CHO vs. MCF7 (CNN Predictor)
Notebook 3b: Differential - CHO vs. MCF7 (Hexamer Regressor)
Notebook 3c: Differential - CHO vs. MCF7 (Both Predictors)

GFP

Evaluation of Likelihood-bounded DENs for engineering GFP variants. Here we combine importance sampling of a variational autoencoder (VAE) and straight-through approximation to propagate likelihood gradients to the generator. The benchmarking test bed is adapted from (Brookes et. al., 2019; Github).

Notebook 0: Importance-Sampled Train Set Likelihoods (VAE)
Notebook 1: Likelihood-bounded DEN Training
Notebook 2a: Plot Bar Chart Comparison
Notebook 2b: Plot Trajectory Comparison

SPI1 TF Binding (DragoNN)

Benchmark evaluation for the DragoNN fitness predictor.

Notebook 1a: Maximal TF Binding Score (Sequence Diversity)
Notebook 1b: Maximal TF Binding Score (Latent Diversity)
Notebook 2a: Benchmark Comparison
Notebook 2b: Benchmark Comparison (Computational Cost)

Transcriptional Activity (MPRA-DragoNN)

Benchmark evaluation for the MPRA-DragoNN fitness predictor.

Notebook 1: Maximal Transcriptional Activity
Notebook 2a: Benchmark Comparison
Notebook 2b: Benchmark Comparison (Computational Cost)

DEN Training GIFs

The following GIFs illustrate how the Deep Exploration Networks converge on generating maximally fit functional sequences while retaining sequence diversity. Throughout training, we track a set of randomly chosen input seeds and animate the corresponding generated sequences (with their fitness costs).

WARNING: The following GIFs contain flickering pixels/colors. Do not look at them if you are sensitive to such images.

Alternative Polyadenylation

The following GIF depicts a generator trained to produce maximally strong polyadenylation signals.

The next GIF illustrates a class-conditional generator trained to produce polya sequences with target cleavage positions.

Alternative Splicing

This GIF depicts a generator trained to maximize splicing at 5 distinct splice junctions.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
analysis		analysis
genesis		genesis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
den_github_logo.png		den_github_logo.png
mlcb_exploration_nets.pdf		mlcb_exploration_nets.pdf
setup.py		setup.py

License

johli/genesis

Folders and files

Latest commit

History

Repository files navigation

Deep Exploration in Sequence Generation

Highlights

Features

Installation

Required Packages

Saved Models

Training & Analysis Notebooks

Alternative Polyadenylation

Alternative Polyadenylation (Likelihood-bounded)

Alternative Splicing

GFP

SPI1 TF Binding (DragoNN)

Transcriptional Activity (MPRA-DragoNN)

DEN Training GIFs

Alternative Polyadenylation

Alternative Splicing

About

Resources

License

Stars

Watchers

Forks

Languages