GitHub - tomsasani/vacv-ont-manuscript: Code used for figure generation and statistical analysis for https://elifesciences.org/articles/35453

Scripts for: Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays

Python code used to characterize copy number and single nucleotide variation in an evolving population of vaccinia virus.

This code is archived on Zenodo.

Dependencies

python 2.7

pysam (v0.12.0.1)

numpy (v1.12.0)

pyfaidx (v0.4.9.2)

seaborn (v0.7.1)

Installation

git clone https://github.com/tomsasani/vacv-ont-manuscript.git
cd vacv-ont-manuscript

Usage

To get information about the arguments passed into a particular script, simply run:

$ python script.py -h

Scripts produce various plots and summary statistics, as described below:

scripts:
summary.py                      : text summary of ONT data
condensed.py                    : linked boxes representing allele frequencies within copies
hist.py                         : stacked bars representing copy number distributions
sb.py                           : stacked bars representing frequencies of allele combinations
array_combinations.py           : proportions of all K3L allele combinations in multicopy arrays
simulate_sequencing_errors.py   : distribute chemistry-specific sequencing errors throughout multicopy arrays

Reproducing figures

To reproduce figures in the manuscript that were generated from ONT sequencing data, first download relevant ONT FASTQ files from the SRA (accession SRP128569) or from Zenodo (DOI:10.5281/zenodo.1319732).

Then, align FASTQ using BWA-MEM. For example:

bwa mem -x ont2d -t $THREADS $REF $FASTQ | sambamba view -S -f bam /dev/stdin | sambamba sort -o $OUT.BAM /dev/stdin

Alternatively, download BAM files (aligned using the above command) from the SRA (SRP128569) or Zenodo.

More details regarding each subcommand are given below, but commands for reproducing figures are as follows.

Figure 2C: python hist.py --ref $REF --bams p5.bam p6.bam p7.bam p10.bam p15.bam p20.bam

Figure 2-figure supplement 3: python hist.py --ref $REF --bams p10.bam p15.bam p20.bam -cn1 6 -cn2 15

Figure 3: python sb.py --ref $REF --bams p5.bam p6.bam p7.bam p10.bam p15.bam p20.bam

Figure 3-figure supplement 1: python sb.py --ref $REF --bams p10.bam p15.bam p20.bam -rand

Figure 3-figure supplement 2: python sb.py --ref $REF --bams p15.r7.bam p15.bam p15.MOI.0.1.bam

Figure 3-figure supplement 3: python simulate_sequencing_errors.py --ref $REF --bam p15.r7.bam --chemistry r7 (repeat for p15.bam and p15.MOI.0.1.bam, which were sequenced with R9 and R9.4 chemistries, respectively)

Figure 3-figure supplement 4: python array_combinations.py --ref $REF --bams p10.bam p15.bam p20.bam -cn 3

Figure 4A: python condensed.py --ref $REF --bam p5.bam (repeat for p10, p15, and p20 BAM files)

Figure 4B: python condensed.py --ref $REF --bam p15.bam -mixed_only

Figure 6: python sb.py --ref $REF --bams p15.MOI.1.bam p15.MOI.0.1.bam p15.MOI.0.01.bam p15.MOI.0.001.bam

Figure 6A: python hist.py --ref $REF --bams p10.bam p15.bam p15.bhk.bam

Figure 6C: python sb.py --ref $REF --bams p10.bam p15.bam p15.bhk.bam

`summary.py`

python summary.py --ref reference> \
                  --bam sorted and indexed BAM file

This will print a summary of K3L CNV and K3L^H47R allele frequencies in a given population. Allele frequencies are calculated from BAM pileups.

For example, output might look like this for alignments from P5...

locus	variant_site	ref	alt	af	coverage (# reads)	coverage (depth)	cn_distrib
K3L   30490   T    C    0.1366    1190  2663    1:787 | 2:206 | 3:122 | 4:45 | 5:17 | 6:8 | 7:3 | 8:1 | 9:1 | 10:0 | 11:0 | 12:0 | 13:0 | 14:0 | 15:0

`hist.py`

Produces a stacked bar histogram of copy number within one or more populations of vaccinia. As seen in Figure 2C and 6A.

python hist.py --ref      reference \
               --bams     sorted and indexed BAM file(s) \
               -cn1       lower bound on copy number to plot (default = 1) \
               -cn2       upper bound on copy number to plot (default = 5) \
               -png       output plot as PNG (by default, output EPS) \
               -o         name of output file (by default, "hist")

`condensed.py`

Produces "condensed" plots of K3L^His47Arg allele frequency in each K3L copy within a population of vaccinia. As seen in Figure 3.

python condensed.py --ref      reference \
                    --bams     sorted and indexed BAM file(s) \
                    -rand      plot a random distribution of H47R alleles in the population \
                    -cn        plot genomes with up to this copy number (default = 5) \
                    -png       output plot as PNG (by default, output EPS) \
                    -o         name of output file (by default, "condensed")

`sb.py`

Produces a stacked bar representation of homogenous and "mixed" WT/K3L^His47Arg arrays within one or more populations of vaccinia. As seen in Figure 3, 5, and 6C.

python sb.py --ref      reference \
             --bams     sorted and indexed BAM file(s) \
             -rand      plot a uniform distribution of H47R alleles in the population \
             -cn        plot genomes with up to this copy number (default = 5) \
             -png       output plot as PNG (by default, output EPS) \
             -o         name of output file (by default, "stackedbar")

`array_combinations.py`

Produces a plot that illustrates the abundance of K3L arrays with every possible combination of K3L^WT/K3L^His47Arg copies (for a given copy number). As seen in Figure 3-figure supplement 3.

python array_combinations.py --ref      reference \
                             --bams     sorted and indexed BAM file(s) \
                             -cn        plot genomes with this copy number (default = 3) \
                             -png       output plot as PNG (by default, output EPS) \
                             -o         name of output file (by default, "array-combinations")

`simulate_sequencing_errors.py`

Iterates over a population of K3L arrays and converts mixed arrays to homogenous arrays. Then, introduces "sequencing errors" by switching K3L^WT and K3L^His47Arg alleles.

python simulate_sequencing_errors.py --ref        reference \
                                     --bam        sorted and indexed BAM file(s) \
                                     --chemistry  the flowcell chemistry used to generate reads in BAM [r7, r9, r9.4] \
                                     -png         output plot as PNG (by default, output EPS) \
                                     -o           name of output file (by default, "sim-errors")

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
array_combinations.py		array_combinations.py
condensed.py		condensed.py
extract_arrays.py		extract_arrays.py
hist.py		hist.py
print_summary.py		print_summary.py
sb.py		sb.py
simulate_sequencing_errors.py		simulate_sequencing_errors.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scripts for: Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays

Dependencies

Installation

Usage

Reproducing figures

`summary.py`

`hist.py`

`condensed.py`

`sb.py`

`array_combinations.py`

`simulate_sequencing_errors.py`

About

Uh oh!

Releases 2

Packages

Languages

License

tomsasani/vacv-ont-manuscript

Folders and files

Latest commit

History

Repository files navigation

Scripts for: Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays

Dependencies

Installation

Usage

Reproducing figures

summary.py

hist.py

condensed.py

sb.py

array_combinations.py

simulate_sequencing_errors.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`summary.py`

`hist.py`

`condensed.py`

`sb.py`

`array_combinations.py`

`simulate_sequencing_errors.py`

Packages