Skip to content

bartonlab/paper-DMS-inference

Repository files navigation

Overview

This repository contains data and scripts for reproducing the results accompanying the manuscript

popDMS infers mutation effects from deep mutational scanning data

Zhenchen Hong1, and John P. Barton2,3,#

1 Department of Physics and Astronomy, University of California, Riverside
2 Department of Physics and Astronomy, University of Pittsburgh
3 Department of Computational and Systems Biology, University of Pittsburgh School of Medicine
# correspondence to jpbarton@pitt.edu

This work is currently available on the bioRxiv at this link.

Contents

Scripts for generating and analyzing simulation data can be found in the simulation.ipynb notebook. Scripts for processing and analyzing deep mutational scanning data are contained in the data_analysis.ipynb notebook. Finally, scripts for analysis and figures contained in the manuscript are located in the figures.ipynb notebook.

Due to the large size and number of some files generated by the interim analysis of deep mutational scanning data, some data has been stored in a compressed format using Zenodo. To access the full set of data, navigate to the Zenodo record. Then download and extract the contents of the archives into the directory epistasis_inference/.

Software dependencies

Methods to infer epistasis are implemented in C++11 and make use of the GNU Scientific Library and Eigen.

Version 3.4.0 of Eigen that we use can be downloaded from this link. For epistasis inference, this file should be unzipped into the ./epistasis_inference/ directory.

Running popDMS

popDMS uses codon counts in dms_tools format or sequence counts in MaveDB-HGVS format for input. For reference, this link demonstrates the format for codon counts, and this link shows an example file in MaveDB-HGVS format.

Running popDMS differs slightly depending on the format of the input data.

Using codon counts

When using codon counts as input, we require three variables: codon_counts_files, replicates, and times. Here codon_counts_files contains a list of file paths to codon counts files. For each file, there must be a corresponding entry in the list replicates that identifies which replicate the file belongs to, and an entry in the list times that gives the time (in numbers of generations) that sequencing was performed to obtain this data. For examples, see data_analysis.ipynb.

Using sequence counts

When using sequence counts, we require five variables: haplotype_counts_file, reference_sequence_file, n_replicates, time_points, and time_cols. The variable haplotype_counts_file gives the path to a file containing the sequence counts. To normalize the selection coefficients relative to a reference sequence, reference_sequence_file should provide the path to a file storing the reference sequence in plain text (for an example, see here). The total number of replicates is specified by n_replicates. For each replicate, the time(s) at which data was collected are given as a list in time_points. Finally, for each replicate, the variable time_cols points to the columns in the haplotype_counts_file that store sequence counts for each time. For examples, see data_analysis.ipynb.

Interpreting the output

For both approaches, popDMS will compute and save the variant frequencies needed to calculate selection coefficients. Using these files, the code to infer the selection coefficients can quickly be rerun using the infer_independent (for codon counts) or infer_correlated (for sequence counts) methods. Both methods will save a compressed comma separated values (CSV) file containing the inferred selection coefficients at the inferred optimal value regularization strength. The file can be unzipped to be viewed in plain text or with a program such as Microsoft Excel.

The columns of the selection coefficient are:

  • site: Specifies the site at which the variant is observed, following the numbering of sites in the original input file
  • amino_acid: Specifies the amino acid (including stops)
  • WT_indicator: Set to True if the amino acid matches the reference at that site, and False otherwise
  • rep_x: Values in these columns give the selection coefficients inferred for each replicate independently, with replicates numbered starting from 0 (rep_0)
  • joint: Joint selection coefficients inferred across all replicates

This CSV file can be used for downstream analysis. We also provide a built-in plotting function fig_dms that takes a path to the CSV file as input and produces a heatmap of the inferred selection coefficients.

Epistasis inference: (already merged in one bash file to run automatically)

The format of input data for epistasis inference is described in this file. Once data has been stored in this format, inference of epistatic interactions proceeds by running the shell script run_epistasis.sh in the epistasis_inference directory.

License

This repository is dual licensed as GPL-3.0 (source code) and CC0 1.0 (figures, documentation, and our presentation of the data).

About

No description, website, or topics provided.

Resources

License

CC0-1.0, GPL-3.0 licenses found

Licenses found

CC0-1.0
LICENSE-CC0.txt
GPL-3.0
LICENSE-GPL.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published