Golden Gate DNA Assembly Size-specified Overhang Finding Tool
License
michaelting/GGSOFT
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
#=======================================================================# # GGSOFT/README v1.3 # # Copyright (c) 2013 Michael Ting # # Released under the BSD 2-clause license. See LICENSE file. # # https://github.com/michaelting # # # # GGSOFT: Golden Gate DNA Assembly Size-specified Overhang Finding Tool # #=======================================================================# #==============# # Introduction # #==============# GGSOFT is intended for use in finding DNA overhangs to be used in Golden Gate DNA Assembly methods, which utilize Type IIs restriction enzyme digestion. Type IIs restriction enzymes cut remotely from their recognition site without regards to the cutting sequence. The length of overhangs produced through RE digestion varies from enzyme to enzyme. GGSOFT explores overhang combinations for use in Golden Gate DNA Assembly that maximizes the distance between all overhang pairs in an attempt to prevent multiple overhangs from annealing in positions other than intended. For purposes such as a one-pot reaction, too similar overhangs can result in Golden Gate DNA fragments annealing together in the wrong positions relative to one another. #=================# # Version History # #=================# * v1.3 - Added functionality for overhang exclusion lists from results. Testing still needed for FASTA file input/output. * v1.2 - Removed dependency on Biopython for FASTA parsing and reverse complement in find_combos. Included reverse complement function rev_comp and modified function process. Testing for output yielded equality for P42212dna.fasta with bounds of 195 and 205 with overhang size 3. Additional testing needed for ValueErrors in FASTA file input/output. * v1.1.3 - More rigorous test cases for the overhang scoring function. * v1.1.2 - Added __init__.py file to specify GGSOFT package, updated documentation with more specific licensing information. * v1.1.1 - MemoryError handling instructions for users. Added documentation for MemoryErrors under Software Usage in README. Added more test output examples. * v1.1 - Added topxcombos.py, allowing for retrieval of the top x percent of scored overhang combinations. TopXCombos functionality enabled in GGSOFT command-line usage. ggsoft.process now uses Bio.SeqIO to parse FASTA files. Swapped optparse out for argparse in ggsoft.main. Overhang scoring includes an exponential ID extension penalty for overhang pairs. Overhang combinations containing palindromic overhangs (e.g. 'GATC') are excluded from final results in ggsoft.find_combos using Bio.Seq. Added option for more verbose information when running GGSOFT. * v1.0 - Initial version of GGSOFT. Future work includes reducing computation time of construction of the scoring table and implementation of a heuristic to reduce memory usage during computation of overhang combinations. #================# # Files Included # #================# * ggsoft.py - runs the GGSOFT program given an input sequence, output file name, overhang size, minimum fragment length, and maximum fragment length * topxcombos.py - retrieves the top X percent of scored overhang combinations specified by user input #=============# # Future Work # #=============# - More specific handling of palindromic overhangs (homodimerization). Current handling of palindromic overhangs results in combinations with those overhangs being excluded completely from final results. May also want to explore user specification of whether to include palindromic overhangs or not. - Speed up scoring table construction - this may be done by constructing smaller tables first and splitting up larger overhangs into smaller problems. Log speedup? - Optimize the scoring function. Current scoring values were chosen arbitrarily on an absolute scale, but attempt to represent the relative values of each type of feature in overhangs. - Change scoring function to account for number of overhangs, allowing for combination comparisons across different numbers of overhangs. - Primer generation functionality in addition to overhang finding? - Use heuristics to speed up finding overhang window region combinations. Could also implement parallelization using multiple threads for overhang scoring. - Use information from homology to optimize overhang locations across a family of related sequences #================# # Software Usage # #================# Because memory usage follows O(4^2n), where n is the length of the overhang, overhangs of size greater than 4 can result in MemoryErrors when creating the overhang scoring table. MemoryErrors can also occur when choosing too small of a fragment size or too wide a range of fragment length ranges due to the exponentially increasing number of overhang combinations to explore. Fortunately, Type IIS enzyme overhangs rarely exceed 4bp and scoring table construction is generally feasible. As for the choice of fragment size and ranges, if MemoryErrors are encountered, one should increase the fragment size and/or decrease the range between the minimum and maximum fragment sizes. For example, choosing a range of 180 to 220 may result in a MemoryError. To alleviate this, one could reduce the range to 190 to 210 to see a speedup in calculations. However, choosing too thin of a fragment size range may result in suboptimal overhang combinations. Results should be inspected to ensure that overhangs predicted are suitable to one's needs. Current scoring values are arbitrary but are designed to provide relative resolution between overhang combinations. Further biological tests are needed to evaluate the optimality of results produced by GGSOFT. #============================# # Required External Packages # #============================# argparse: Included in the Python standard library for Python >= 2.7 and >= 3.2 For earlier versions, argparse is provided as a separate package. More information can be found at https://pypi.python.org/pypi/argparse. nose: Python testing framework. Version 1.3.0. More information can be found at https://pypi.python.org/pypi/nose/1.3.0. #======================# # Removed Dependencies # #======================# Biopython: Removed in v1.2. Originally used for FASTA file input and reverse complement for DNA strings. ======================================================== ***To see more information about command-line arguments: ======================================================== $ python ggsoft.py -h $ python topxcombos.py -h ======================================= ***To run GGSOFT from the command line: ======================================= $ python ggsoft.py in out m n k in: (str) the input FASTA file of a single DNA sequence to be constructed out: (str) the name of the output file you want to write results to m: (int) the minimum fragment size in base pairs n: (int) the maximum fragment size in base pairs k: (int) the overhang size in base pairs Optional arguments: -e exfile, --exclude exfile: exfile: (str) the name of the file with overhangs to be excluded from scoring combinations -p percent topxfile, --percent percent topxfile: percent: (float) the percent of combos to retain from the output file from GGSOFT topxfile: (str) the name of the file retaining the top x percent of combos -v, --verbose Make the program more verbose =========================================== ***To run TopXCombos from the command line: =========================================== $ python topxcombos.py infile outfile percent infile: (str) the results file from a GGSOFT run outfile: (str) the name of the output file to contain the top X percent of overhang combinations from the infile percent: (float) the top X percent of combinations to retrieve, e.g. 23.2 #===========================# # Formatting of Input Files # #===========================# ===== Input Sequence ===== Input sequence files should contain a single DNA string in FASTA format. ===== Exclusion List File ===== Exclusion lists should be formatted with overhangs to be excluded on separate lines in the exclusion list file. For example, if we wanted to exclude 'AATG','CATT', and 'CGAT' from scored overhang combinations, we would format the exclusion list file as so: AATG CATT CGAT #============================# # Formatting of Output Files # #============================# ===== Output from GGSOFT ===== Output from ggsoft.py is written as such: ((202, 412, 617), (36, [TTCG, GGTA, CCAC])) ((202, 412, 622), (36, [TTCG, GGTA, CAAT])) ... ((204, 414, 622), (8, [CGGT, TAAC, CAAT])) The sequence in the left half of the pair, (202, 412, 617), indicates the indices where overhangs begin in the template sequence. ********************************************************************************** NOTE: In the output data, The template sequence is 0-INDEXED, while many other DNA software programs use 1-INDEXED notation. ********************************************************************************** The right half, (36, [TTCG, GGTA, CCAC]), indicates that the overall score of the overhang combination is 36, and the overhangs resulting from the listed indices are [TTCG, GGTA, CCAC], meaning we have TTCG starting at index 202, GGTA starting at index 412, and CCAC starting at index 617 in our template sequence. Using an exclusion list file will list the number of overhang combinations excluded from the total number of overhang combinations examined. ===== Output from Top X Combos ===== Output from topxcombos.py is written as such: Top 17 percent of combinations from ./tests/testggsoft_P42212.txt 261 of 1540 combinations selected ((202, 412, 617), (36, [TTCG, GGTA, CCAC])) ((202, 412, 622), (36, [TTCG, GGTA, CAAT])) ... ((194, 404, 610), (34, [TCAC, AAGA, TACT])) topxcombos.py extracts the top x percent of lines from a ggsoft output file. #============# # Algorithms # #============# Given a user input of overhang size, determine optimal DNA overhang sequences using a scoring function to maximize pairwise distances between all pairwise combinations of overhangs, 4**n, where n is the length of the overhang. Given an overhang size of 4, there are 4**4 = 256 possible unique overhangs. Given an overhang size of 5, there are 4**5 = 1024 possible unique overhangs. A scoring table is precalculated given the overhang length, which is used to accelerate the scoring calculation by storing pairwise overhand scores in memory. For an overhang of size 4, a size 256*256 table is calculated. For an overhang of size 5, a size 1024*1024 table is calculated. Note that for overhangs of length greater than 4bp, computation time of the scoring table becomes exponentially large. Computation time also increases with decreasing fragment size (more fragments, more overhang combinations). #==================# # Overhang Scoring # #==================# The base scoring value is computed linearly by pairwise comparisons of bases at the same indices in two different overhangs. Scores are added to the base score of 0. For identical repeats, an exponentially increasing score is subtracted from the overall score based on the length of the identical section. Overhang combinations with the highest score are considered the best candidates for usage in DNA assembly. Pairwise comparisons score bases on transversions, transitions, and identicals. Transversion: 8 Transition: 1 Identical: -10 Identical base penalty: 3^(L-1) L = length of identical base regions between overhangs The biological basis for this scoring scheme (relative to one another) is that we want to avoid identical bases at identical positions in overhangs. Having two separate overhangs with identical bases would make the overhangs inspecific, which can result in DNA fragments annealing to the wrong locations during assembly. Both transitions and transversions are scored higher than identicals because overhangs with differing bases at the same positions are less likely to anneal in the wrong location (which we are specifying). Transversions are score higher than transitions because similar bases can still potentially anneal even if there is a mismatch. Pyrimidines, C and T, and purings, A and G, have 3 and 2 hydrogen bonds, respectively, and are similar in chemical structure such that the overhangs AGTT and TTAA may still anneal even though the GT is not a Watson-Crick base pair. Take for example two overhangs ATGT and ACAG. The first pair, corresponding to the A in ATGT and the first A in ACAG, would be scored as an identical pair, producing a score of -10. The second pair, corresponding to the first T in ATGT and the C in ACAG, would be scored as a transition, adding 1 to the overall score of -10, producing a score of -9. The third pair, GA, would also be scored as a transition, adding 1 to the overall score of -9, producing a score of -8. The final pair, TG, would be scored as a transversion, adding 8 to the overall score of -8, producing a final score of 0.
About
Golden Gate DNA Assembly Size-specified Overhang Finding Tool
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published