GitHub - michaelting/GGSOFT: Golden Gate DNA Assembly Size-specified Overhang Finding Tool

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
tests		tests
LICENSE		LICENSE
README		README
__init__.py		__init__.py
ggsoft.py		ggsoft.py
ggsoft.pyc		ggsoft.pyc
topxcombos.py		topxcombos.py
Repository files navigation

#=======================================================================#
# GGSOFT/README	v1.3							#
# Copyright (c) 2013 Michael Ting					#
# Released under the BSD 2-clause license. See LICENSE file.		#
# https://github.com/michaelting					#
#									#
# GGSOFT: Golden Gate DNA Assembly Size-specified Overhang Finding Tool #
#=======================================================================#

#==============#
# Introduction #
#==============#

GGSOFT is intended for use in finding DNA overhangs to be used in 
Golden Gate DNA Assembly methods, which utilize Type IIs restriction
enzyme digestion. Type IIs restriction enzymes cut remotely from their
recognition site without regards to the cutting sequence. The length of
overhangs produced through RE digestion varies from enzyme to enzyme.

GGSOFT explores overhang combinations for use in Golden Gate DNA
Assembly that maximizes the distance between all overhang pairs in an
attempt to prevent multiple overhangs from annealing in positions other
than intended. For purposes such as a one-pot reaction, too similar
overhangs can result in Golden Gate DNA fragments annealing together
in the wrong positions relative to one another.

#=================#
# Version History #
#=================#

* v1.3		- Added functionality for overhang exclusion lists from results.
		  Testing still needed for FASTA file input/output.

* v1.2		- Removed dependency on Biopython for FASTA parsing and reverse
		  complement in find_combos. Included reverse complement function
		  rev_comp and modified function process. Testing for output yielded
		  equality for P42212dna.fasta with bounds of 195 and 205 with
		  overhang size 3. Additional testing needed for ValueErrors in
		  FASTA file input/output.

* v1.1.3	- More rigorous test cases for the overhang scoring function.

* v1.1.2	- Added __init__.py file to specify GGSOFT package, updated documentation
		  with more specific licensing information.

* v1.1.1 	- MemoryError handling instructions for users. Added documentation
		  for MemoryErrors under Software Usage in README. Added more test
		  output examples.

* v1.1 		- Added topxcombos.py, allowing for retrieval of the top x percent
	  	  of scored overhang combinations. TopXCombos functionality enabled
	 	  in GGSOFT command-line usage. ggsoft.process now uses Bio.SeqIO to
	 	  parse FASTA files. Swapped optparse out for argparse in ggsoft.main.
	 	  Overhang scoring includes an exponential ID extension penalty for
	 	  overhang pairs. Overhang combinations containing palindromic
	 	  overhangs (e.g. 'GATC') are excluded from final results in
	 	  ggsoft.find_combos using Bio.Seq. Added option for more verbose
	 	  information when running GGSOFT.

* v1.0 		- Initial version of GGSOFT. Future work includes reducing computation 
	 	  time of construction of the scoring table and implementation of 
	 	  a heuristic to reduce memory usage during computation of overhang 
	 	  combinations.

#================#
# Files Included #
#================#

* ggsoft.py 	- runs the GGSOFT program given an input sequence, output file name,
	      	  overhang size, minimum fragment length, and maximum fragment length

* topxcombos.py	- retrieves the top X percent of scored overhang combinations
		  specified by user input	 
	 
#=============#
# Future Work #
#=============#

 - More specific handling of palindromic overhangs (homodimerization). Current
   handling of palindromic overhangs results in combinations with those
   overhangs being excluded completely from final results. May also want to
   explore user specification of whether to include palindromic overhangs or not.
 - Speed up scoring table construction - this may be done by constructing smaller tables
   first and splitting up larger overhangs into smaller problems. Log speedup?
 - Optimize the scoring function. Current scoring values were chosen arbitrarily on
   an absolute scale, but attempt to represent the relative values of each type of
   feature in overhangs.
 - Change scoring function to account for number of overhangs, allowing for
   combination comparisons across different numbers of overhangs.
 - Primer generation functionality in addition to overhang finding? 
 - Use heuristics to speed up finding overhang window region combinations.
   Could also implement parallelization using multiple threads for overhang scoring.
 - Use information from homology to optimize overhang locations across
   a family of related sequences
	 
#================#
# Software Usage #
#================#

Because memory usage follows O(4^2n), where n is the length of the overhang, overhangs of size 
greater than 4 can result in MemoryErrors when creating the overhang scoring table. MemoryErrors 
can also occur when choosing too small of a fragment size or too wide a range of fragment length
ranges due to the exponentially increasing number of overhang combinations to explore.

Fortunately, Type IIS enzyme overhangs rarely exceed 4bp and scoring table construction is
generally feasible. As for the choice of fragment size and ranges, if MemoryErrors are encountered,
one should increase the fragment size and/or decrease the range between the minimum and maximum
fragment sizes. For example, choosing a range of 180 to 220 may result in a MemoryError. To alleviate
this, one could reduce the range to 190 to 210 to see a speedup in calculations. However, choosing
too thin of a fragment size range may result in suboptimal overhang combinations. Results should
be inspected to ensure that overhangs predicted are suitable to one's needs.

Current scoring values are arbitrary but are designed to provide relative resolution between
overhang combinations. Further biological tests are needed to evaluate the optimality of
results produced by GGSOFT.

#============================#
# Required External Packages #
#============================#

argparse:	Included in the Python standard library for Python >= 2.7 and >= 3.2
		For earlier versions, argparse is provided as a separate package.
		More information can be found at https://pypi.python.org/pypi/argparse.

nose:		Python testing framework. Version 1.3.0. More information can be found at
		https://pypi.python.org/pypi/nose/1.3.0.

#======================#
# Removed Dependencies #
#======================#

Biopython:	Removed in v1.2. Originally used for FASTA file input and reverse 
		complement for DNA strings.

========================================================
***To see more information about command-line arguments:
========================================================

$ python ggsoft.py -h
$ python topxcombos.py -h

=======================================
***To run GGSOFT from the command line:
=======================================

$ python ggsoft.py in out m n k

in:	(str) the input FASTA file of a single DNA sequence to be constructed
out:	(str) the name of the output file you want to write results to
m:	(int) the minimum fragment size in base pairs
n:	(int) the maximum fragment size in base pairs
k:	(int) the overhang size in base pairs

Optional arguments:
-e exfile, --exclude exfile:
	exfile:		(str)	the name of the file with overhangs to be excluded from scoring
				combinations
-p percent topxfile, --percent percent topxfile:
	percent: 	(float) the percent of combos to retain from the output file from GGSOFT
	topxfile: 	(str)	the name of the file retaining the top x percent of combos
-v, --verbose
	Make the program more verbose

===========================================
***To run TopXCombos from the command line:
===========================================

$ python topxcombos.py infile outfile percent

infile:		(str)	the results file from a GGSOFT run
outfile:	(str) 	the name of the output file to contain the top X percent of
			overhang combinations from the infile
percent:	(float) the top X percent of combinations to retrieve, e.g. 23.2

#===========================#
# Formatting of Input Files #
#===========================#

===== Input Sequence =====
Input sequence files should contain a single DNA string in FASTA format.

===== Exclusion List File =====
Exclusion lists should be formatted with overhangs to be excluded on separate lines in
the exclusion list file. For example, if we wanted to exclude 'AATG','CATT', and 'CGAT'
from scored overhang combinations, we would format the exclusion list file as so:

AATG
CATT
CGAT

#============================#
# Formatting of Output Files #
#============================#

===== Output from GGSOFT =====
Output from ggsoft.py is written as such:

((202, 412, 617), (36, [TTCG, GGTA, CCAC]))
((202, 412, 622), (36, [TTCG, GGTA, CAAT]))
...
((204, 414, 622), (8, [CGGT, TAAC, CAAT]))

The sequence in the left half of the pair, (202, 412, 617), indicates the indices where
overhangs begin in the template sequence. 

**********************************************************************************
NOTE: In the output data, The template sequence is 0-INDEXED, while many other DNA
      software programs use 1-INDEXED notation.
**********************************************************************************

The right half, (36, [TTCG, GGTA, CCAC]), indicates that the overall score of the 
overhang combination is 36, and the overhangs resulting from the listed indices 
are [TTCG, GGTA, CCAC], meaning we have TTCG starting at index 202, GGTA starting 
at index 412, and CCAC starting at index 617 in our template sequence.

Using an exclusion list file will list the number of overhang combinations excluded
from the total number of overhang combinations examined.

===== Output from Top X Combos =====
Output from topxcombos.py is written as such:

Top 17 percent of combinations from ./tests/testggsoft_P42212.txt
261 of 1540 combinations selected
((202, 412, 617), (36, [TTCG, GGTA, CCAC]))
((202, 412, 622), (36, [TTCG, GGTA, CAAT]))
...
((194, 404, 610), (34, [TCAC, AAGA, TACT]))

topxcombos.py extracts the top x percent of lines from a ggsoft output file.

#============#
# Algorithms #
#============#

Given a user input of overhang size, determine optimal DNA overhang sequences using
a scoring function to maximize pairwise distances between all pairwise combinations
of overhangs, 4**n, where n is the length of the overhang.

Given an overhang size of 4, there are 4**4 = 256 possible unique overhangs.
Given an overhang size of 5, there are 4**5 = 1024 possible unique overhangs.

A scoring table is precalculated given the overhang length, which is used to 
accelerate the scoring calculation by storing pairwise overhand scores in memory.

For an overhang of size 4, a size 256*256 table is calculated.
For an overhang of size 5, a size 1024*1024 table is calculated.

Note that for overhangs of length greater than 4bp, computation time of the scoring
table becomes exponentially large. Computation time also increases with decreasing
fragment size (more fragments, more overhang combinations).

#==================#
# Overhang Scoring #
#==================#

The base scoring value is computed linearly by pairwise comparisons of bases at the
same indices in two different overhangs. Scores are added to the base score of 0.
For identical repeats, an exponentially increasing score is subtracted from the 
overall score based on the length of the identical section. Overhang combinations
with the highest score are considered the best candidates for usage in DNA assembly.

Pairwise comparisons score bases on transversions, transitions, and identicals.

Transversion: 	8
Transition:	1
Identical:	-10

Identical base penalty: 3^(L-1)
L = length of identical base regions between overhangs

The biological basis for this scoring scheme (relative to one another) is that we
want to avoid identical bases at identical positions in overhangs. Having two separate
overhangs with identical bases would make the overhangs inspecific, which can result
in DNA fragments annealing to the wrong locations during assembly. Both transitions
and transversions are scored higher than identicals because overhangs with differing
bases at the same positions are less likely to anneal in the wrong location (which
we are specifying). Transversions are score higher than transitions because similar
bases can still potentially anneal even if there is a mismatch. Pyrimidines, C and T,
and purings, A and G, have 3 and 2 hydrogen bonds, respectively, and are similar in 
chemical structure such that the overhangs AGTT and TTAA may still anneal even though
the GT is not a Watson-Crick base pair.

Take for example two overhangs ATGT and ACAG. The first pair, corresponding to the 
A in ATGT and the first A in ACAG, would be scored as an identical pair, producing
a score of -10. The second pair, corresponding to the first T in ATGT and the C in
ACAG, would be scored as a transition, adding 1 to the overall score of -10, producing
a score of -9. The third pair, GA, would also be scored as a transition, adding 1 to
the overall score of -9, producing a score of -8. The final pair, TG, would be scored
as a transversion, adding 8 to the overall score of -8, producing a final score of 0.