DNA as an information storage medium using either 4-base codon or binary representations of 256-ASCII
License
michaelting/ASCII_DNA_Translator
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
#=======================================================================# # ASCII Text to DNA Translator # # v1.1 # # Updated 30 August 2013 # # Copyright 2013 Michael Ting # # Released under the BSD 2-clause license. See LICENSE file. # # http://opensource.org/licenses/BSD-2-Clause # #=======================================================================# #================# # The Experiment # #================# The purpose of ASCIIcodons is to test the viability of DNA as an information storage medium. The experiment attempts to reconstruct information encoded in DNA, given the information- encoded DNA samples themselves (assuming the sequence information could be read) and a translation table between the DNA encoding scheme and 256-ASCII. An issue with encoding information in DNA is that DNA synthesis becomes increasingly difficult as the length of the DNA to be synthesized increases. To efficiently produce such encoding DNA, the DNA must be split into smaller oligos, which are much cheaper to produce and tend to be more accurate than longer sequences. However, producing much smaller oligos means we must also know how to place these oligos in the correct order to read out the correct information once assembled. To establish order, we need to rely on some secondary encoding scheme to indicate how the DNA pieces should be reconstructed. This was performed using DNA tags at the header of each oligo. Tags indicate global positioning of each oligo. Our dataset consisted of a set of "profile" information belonging to separate persons. Tags included a person ID number (one for each person) and an oligo ID number for positioning of information in each person's profile. Using these tags, a MapReduce-like algorithm is used to sort oligo sequencing reads and obtain a consensus for each of those reads (multiple reads of the same sequence can occur many times during sequencing). After obtaining a consensus sequence (which is asserted to have more accurate information than the individual oligos themselves), the oligo sequences are concatenated to produce a much longer DNA sequence, which when decoded, yields the original encoded information. #============# # Algorithms # #============# Data was encoded into DNA using 4-base codon translation, and subsequently chunked and ordered on a DNA CustomArray. The DNA from the CustomArray order was sequenced using Illumina MiSeq, with results found in two FASTQ read files. These reads were merged using SeqPrep to produce a cleaner merged sequencing FASTQ file. To reconstruct information from the sequencing reads, each DNA sequence contains a 28-base long header tag indicating person ID and oligo ID. Information is reconstructed in a manner similar to the MapReduce algorithm: 1. (SORT/MAP): Sequences from the FASTQ Illumina MiSeq sequencing file are read and sorted into buckets based on person ID and oligo ID in a dictionary data structure. Sequences with identical tags are placed in the same bucket, so we have unique buckets for each tag. Sequences are recorded by keeping track of the counts of base appearances at each index in the sequence through the use of a base counter. 2. (CONSENSUS/COMBINE): Each bucket containing a list of counters for each index is condensed into a consensus sequence by taking the maximum occurrence of a DNA base at each position in the sequence. Erroneous sequences are filtered out by using a count threshold (low counts imply erroneous sequences) 3. (CONDENSE/REDUCE): Oligos for each person are concatenated into a single condensed DNA file. The condensed file is translated from 4-base DNA codons to 256-character ASCII. Erroneous sequences are ignored when writing output files. #=======================================# # Future Improvements to the Experiment # #=======================================# - Parallelization of MapReduce-like algorithm would make the run much faster (as MapReduce is intended for parallelization). For the purposes of this experiment, the algorithm was intended to be run by a single user on a single machine. - The COMBINE step may benefit from more accurate alignment, such as with the use of an alignment method like Smith-Waterman, rather than relying on a regex pattern to fix the alignment. - Use DNA error-correction codes on the tags, messages, or both. This will help filter out erroneous reads due to synthesis or sequencing and eliminate junk data with incorrect tags. Synthesis and sequencing can introduce errors in sequence information in the form of base insertions/deletions or sequencing mis-reads due to homopolymers, chemical errors during sequencing, etc. - 4-Base Codon Encoding: A known issue is that standard English alphabet characters are encoded within the range of values 64-128, meaning using base 4 encoding creates a skewed distribution of DNA codons when sampling the alphabet. All alphabetical characters will begin with the same base in the initial base position, which can result in high GC content for long strings of English text, given that the first base is C or G. There is a quick, but temporary, fix for this - mapping T:0, A:1, G:2, C:3 - but these mappings are arbitrary and still skew the distribution towards AT. A better solution would be to randomize the mapping of codons to ASCII characters and/or reduce the number of bases used in the codons by eliminating unused ASCII characters. * Change the distribution of character mappings to be more uniform * Handle DNA strings of length indivisible by 4 * Allow reverse translation in 4 different reading frames #==================# # Encoding Methods # #==================# Two options are available: - 4-Base Codon Translation 256-character ASCII is used to translate to 4-base DNA codons, analogous to the quaternary numeral system where T:0, A:1, G:2, C:3. - Binary Intermediate Translation 256-character ASCII numeric encodings is converted to 8-bit wide binary, from which the binary code is converted one-to-one to DNA bases, where 0: (A or C), 1: (G or T). Bases are selected at random using probability distributions with equal weight for each base, and repetitive binary numerals result in forced alternating bases to reduce homopolymers and balance GC content. #=======# # Notes # #=======# DNA output from binaryDNA will be twice as long as the output from ASCIIcodons. ASCIIcodons encodes 256 characters using 4-base codons (4^4 distinct codons). Therefore, one character corresponds to 4 DNA bases. binaryDNA encodes one-to-one between binary and DNA bases, with 0 mapping to A or C and 1 mapping to G or T. Binary representations of ASCII codes are a byte (or 8 bits) long, so one character corresponds to 8 DNA bases. #====================# # Encoding Quick Use # #====================# Codons: Modify the "infile" variable in run_encode_codon.py to the name of the ASCII text file you want to convert into DNA. You may also choose to modify the outfile and check file names. Then run the command: $ python run_encode_codon.py Binary: Modify the "infile" variable in run_encode_binary.py to the name of the ASCII text file you want to convert into DNA. You may also choose to modify the outfile and check file names. Then run the command: $ python run_encode_binary.py Both programs will produce output files (outfile) and check files (check). #===================# # Using ASCIIcodons # #===================# Current object classes included: * TextToDNA * DNAToText To translate input files, start up the Python interpreter and instantiate one of the above object classes: >>> from ASCIIcodons import TextToDNA >>> obj = TextToDNA() To translate an ASCII text file to DNA, specify input and output files: >>> obj.translate("/path/to/ASCIItextfile","/path/to/outputfile") The output file will print to the screen, and can also be accessed from the specific path location. For reverse translation from DNA to ASCII text: >>> from ASCIIcodons import DNAToText >>> obj = DNAToText() >>> obj.translate("/path/to/dnatextfile","/path/to/outputfile") To use all object classes simultaneously: >>> from ASCIIcodons import * #=================# # Using binaryDNA # #=================# Current object classes include: * BinaryTextToDNA * DNAToBinaryText To translate input files, start up the Python interpreter and instantiate one of the above object classes: >>> from binaryDNA import BinaryTextToDNA >>> obj = BinaryTextToDNA() To Translate an ASCII text file to DNA through binary, specify input and output files: >>> obj.translate("/path/to/ASCIItextfile","/path/to/outputfile") The output file will print to the screen, and can also be accessed from the specific path location. For reverse translation from DNA to ASCII text through binary: >>> from binaryDNA import DNAToBinaryText >>> obj = DNAToBinaryText() >>> obj.translate("/path/to/dnatextfile","/path/to/outputfile") To use all object classes simultaneously: >>> from binaryDNA import * #=====================# # Formatting of files # #=====================# Input text files support the ASCII 256-character set. Input DNA files should be formatted as plain strings of upper-case DNA: ATGAGGATTTACGGGT CCAC ATCGAGACCCCA For 4-base codons, length of DNA strings should be a multiple of 4. Handling of DNA with lengths indivisible by 4 will be implemented in the future. For binary encoding, length of DNA strings should be a multiple of 8, as the binary enconding of DNA utilizes 8 bits for each character in 256-ASCII. #===================# # Chunker Quick Use # #===================# Codons: Modify the "infsta" variable in run_chunker_codon.py to the name of the FASTA file you want to convert into DNA. You may also choose to modify the cfname, mfname, and trname variables. Then run the command: $ python run_chunker_codon.py Binary: Modify the "infsta" variable in run_chunker_binary.py to the name of the FASTA file you want to convert into DNA. You may also choose to modify the cfname, mfname, and trname variables. Then run the command: $ python run_chunker_binary.py Both programs will produce a 150bp long chunk file (cfname), a chunk file with front adaptors removed (mfname), and a translation file as a check for correct encoding (trname). #=====================# # Codon Array Chunker # #=====================# - Run with contig and step size as 76 - TAG at head of dna is format $ _ _ # _ _ _ meaning 7 characters, so 28 DNA bases - To ensure mod 4, pad the ends of the DNA with 2 bases (106 --> 104, padded with AA at the end) - 104-28 = 76 for the chunk size - stuffer sequence is TGAC, corresponding to single quote ' Steps to replicate: $ python arraychunker.py infile.fasta outfile.txt 76 76 TGAC $ python >>> import os >>> infile = open("outfile.txt","r") >>> newfile = open("new.txt","w") >>> for line in infile: ... newfile.write("%s\n" % line[2:]) # to account for correct reading frame since adaptor is 22 bp ... newfile.flush() ... os.fsync(newfile.fileno()) >>> infile.close() >>> newfile.close() $ ls $ python >>> from ASCIIcodons import * >>> o = DNAToText() >>> o.translate("new.txt","trans.txt") #======================# # Binary Array Chunker # #======================# - Run with contig and step size as 48 - TAG at head of dna is format $ _ _ # _ _ _ meaning 7 characters, so 56 DNA bases - To ensure mod 8, pad the ends of the DNA with 2 bases (106 --> 104, padded with AA at the end) - 104-56 = 48 for the chunk size - stuffer sequence is TGAC, corresponding to single quote ' Steps to replicate: $ python binarraychunker.py infile.txt outfile.txt 48 48 ACGACTGT $ python >>> import os >>> infile = open("outfile.txt","r") >>> newfile = open("new.txt","w") >>> for line in infile: ... newfile.write("%s\n" % line[22:126]) # to account for correct reading frame since adaptor is 22 bp ... newfile.flush() # for huge files, may stop writing, so we need to flush I/O ... os.fsync(newfile.fileno()) >>> infile.close() >>> newfile.close() $ ls $ python >>> from binaryDNA import * >>> o = DNAToBinaryText() >>> o.translate("new.txt","trans.txt") #====================================# # Using MapReduce on Sequencing Data # #====================================# The algorithm can be run by executing get_unique_oligos.py. To modify the input sequencing file, modify the path of "parser" in main(). To modify the output location of the results, modify the path of "treepath" in main(). Results will be placed in a folder specified by "treepath".
About
DNA as an information storage medium using either 4-base codon or binary representations of 256-ASCII
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published