GitHub - michaelting/ASCII_DNA_Translator: DNA as an information storage medium using either 4-base codon or binary representations of 256-ASCII

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
testfiles		testfiles
ASCIIcodons.py		ASCIIcodons.py
LICENSE		LICENSE
README		README
__init__.py		__init__.py
arraychunker.py		arraychunker.py
binarraychunker.py		binarraychunker.py
binaryDNA.py		binaryDNA.py
custarr_to_fastq.py		custarr_to_fastq.py
get_unique_oligos.py		get_unique_oligos.py
parse_fastq.py		parse_fastq.py
Repository files navigation

#=======================================================================#
# ASCII Text to DNA Translator						#
# v1.1									#
# Updated 30 August 2013						#
# Copyright 2013 Michael Ting						#
# Released under the BSD 2-clause license. See LICENSE file.		#
# http://opensource.org/licenses/BSD-2-Clause				#
#=======================================================================#

#================#
# The Experiment #
#================#

The purpose of ASCIIcodons is to test the viability of DNA
as an information storage medium. The experiment attempts to
reconstruct information encoded in DNA, given the information-
encoded DNA samples themselves (assuming the sequence information
could be read) and a translation table between the DNA encoding 
scheme and 256-ASCII.

An issue with encoding information in DNA is that DNA synthesis 
becomes increasingly difficult as the length of the DNA to be 
synthesized increases. To efficiently produce such encoding DNA,
the DNA must be split into smaller oligos, which are much cheaper
to produce and tend to be more accurate than longer sequences.
However, producing much smaller oligos means we must also know
how to place these oligos in the correct order to read out
the correct information once assembled.

To establish order, we need to rely on some secondary encoding 
scheme to indicate how the DNA pieces should be reconstructed. 
This was performed using DNA tags at the header of each oligo.
Tags indicate global positioning of each oligo.

Our dataset consisted of a set of "profile" information belonging
to separate persons. Tags included a person ID number (one for 
each person) and an oligo ID number for positioning of information
in each person's profile.

Using these tags, a MapReduce-like algorithm is used to sort
oligo sequencing reads and obtain a consensus for each of those 
reads (multiple reads of the same sequence can occur many 
times during sequencing). After obtaining a consensus sequence 
(which is asserted to have more accurate information than the 
individual oligos themselves), the oligo sequences are concatenated
to produce a much longer DNA sequence, which when decoded, yields 
the original encoded information.

#============#
# Algorithms #
#============#

Data was encoded into DNA using 4-base codon translation, and
subsequently chunked and ordered on a DNA CustomArray.

The DNA from the CustomArray order was sequenced using
Illumina MiSeq, with results found in two FASTQ read files. These
reads were merged using SeqPrep to produce a cleaner merged
sequencing FASTQ file.

To reconstruct information from the sequencing reads, each DNA
sequence contains a 28-base long header tag indicating person
ID and oligo ID. Information is reconstructed in a manner
similar to the MapReduce algorithm:

1. (SORT/MAP): 
   Sequences from the FASTQ Illumina MiSeq sequencing file are
   read and sorted into buckets based on person ID and oligo ID
   in a dictionary data structure. Sequences with identical tags 
   are placed in the same bucket, so we have unique buckets for
   each tag. Sequences are recorded by keeping track of the
   counts of base appearances at each index in the sequence
   through the use of a base counter.
2. (CONSENSUS/COMBINE):
   Each bucket containing a list of counters for each index is
   condensed into a consensus sequence by taking the maximum occurrence 
   of a DNA base at each position in the sequence. Erroneous sequences
   are filtered out by using a count threshold (low counts imply
   erroneous sequences)
3. (CONDENSE/REDUCE):
   Oligos for each person are concatenated into a single condensed
   DNA file. The condensed file is translated from 4-base DNA codons 
   to 256-character ASCII. Erroneous sequences are ignored when
   writing output files.

#=======================================#
# Future Improvements to the Experiment #
#=======================================#

- Parallelization of MapReduce-like algorithm would make the run
  much faster (as MapReduce is intended for parallelization). For
  the purposes of this experiment, the algorithm was intended to be
  run by a single user on a single machine.

- The COMBINE step may benefit from more accurate alignment,
  such as with the use of an alignment method like Smith-Waterman,
  rather than relying on a regex pattern to fix the alignment.
  
- Use DNA error-correction codes on the tags, messages, or both.
  This will help filter out erroneous reads due to synthesis or
  sequencing and eliminate junk data with incorrect tags.
  Synthesis and sequencing can introduce errors in sequence 
  information in the form of base insertions/deletions or 
  sequencing mis-reads due to homopolymers, chemical errors
  during sequencing, etc.

- 4-Base Codon Encoding:
  A known issue is that standard English alphabet characters are 
  encoded within the range of values 64-128, meaning using base 4
  encoding creates a skewed	distribution of DNA codons when sampling
  the alphabet. All alphabetical characters will begin with the same
  base in the initial base position, which can result in high GC 
  content for long strings of English text, given that the first base
  is C or G.

  There is a quick, but temporary, fix for this - mapping T:0, A:1, 
  G:2, C:3 - but these mappings are arbitrary and still skew the 
  distribution towards AT.

  A better solution would be to randomize the mapping of codons to
  ASCII characters and/or reduce the number of bases used in the 
  codons by eliminating unused ASCII characters.

  * Change the distribution of character mappings to be more uniform
  * Handle DNA strings of length indivisible by 4
  * Allow reverse translation in 4 different reading frames  
  
#==================#
# Encoding Methods #
#==================#

Two options are available:
 - 4-Base Codon Translation
	256-character ASCII is used to translate to 4-base DNA codons, 
	analogous to the quaternary numeral system where T:0, A:1, G:2, C:3.
 - Binary Intermediate Translation
	256-character ASCII numeric encodings is converted to 8-bit wide
	binary, from which the binary code is converted one-to-one to
	DNA bases, where 0: (A or C), 1: (G or T). Bases are selected
	at random using probability distributions with equal weight
	for each base, and repetitive binary numerals result in forced 
	alternating bases to reduce homopolymers and balance GC content.

#=======#
# Notes #
#=======#

DNA output from binaryDNA will be twice as long as the output from ASCIIcodons.

ASCIIcodons encodes 256 characters using 4-base codons (4^4 distinct codons).
Therefore, one character corresponds to 4 DNA bases.

binaryDNA encodes one-to-one between binary and DNA bases, with 0 mapping to
A or C and 1 mapping to G or T. Binary representations of ASCII codes are
a byte (or 8 bits) long, so one character corresponds to 8 DNA bases.

#====================#
# Encoding Quick Use #
#====================#

Codons:

Modify the "infile" variable in run_encode_codon.py to the name of the ASCII
text file you want to convert into DNA. You may also choose to modify the 
outfile and check file names. Then run the command:
	
    $ python run_encode_codon.py

Binary:

Modify the "infile" variable in run_encode_binary.py to the name of the ASCII
text file you want to convert into DNA. You may also choose to modify the
outfile and check file names. Then run the command:

    $ python run_encode_binary.py

Both programs will produce output files (outfile) and check files (check).

#===================#
# Using ASCIIcodons #
#===================#

Current object classes included:

* TextToDNA
* DNAToText

To translate input files, start up the Python interpreter and instantiate
one of the above object classes:

>>> from ASCIIcodons import TextToDNA
>>> obj = TextToDNA()

To translate an ASCII text file to DNA, specify input and output files:

>>> obj.translate("/path/to/ASCIItextfile","/path/to/outputfile")

The output file will print to the screen, and can also be accessed from the
specific path location.

For reverse translation from DNA to ASCII text:

>>> from ASCIIcodons import DNAToText
>>> obj = DNAToText()
>>> obj.translate("/path/to/dnatextfile","/path/to/outputfile")

To use all object classes simultaneously:

>>> from ASCIIcodons import *

#=================#
# Using binaryDNA #
#=================#

Current object classes include:

* BinaryTextToDNA
* DNAToBinaryText

To translate input files, start up the Python interpreter and instantiate
one of the above object classes:

>>> from binaryDNA import BinaryTextToDNA
>>> obj = BinaryTextToDNA()

To Translate an ASCII text file to DNA through binary, specify input and output files:

>>> obj.translate("/path/to/ASCIItextfile","/path/to/outputfile")

The output file will print to the screen, and can also be accessed from the
specific path location.

For reverse translation from DNA to ASCII text through binary:

>>> from binaryDNA import DNAToBinaryText
>>> obj = DNAToBinaryText()
>>> obj.translate("/path/to/dnatextfile","/path/to/outputfile")

To use all object classes simultaneously:

>>> from binaryDNA import *

#=====================#
# Formatting of files #
#=====================#

Input text files support the ASCII 256-character set.

Input DNA files should be formatted as plain strings of upper-case DNA:

	ATGAGGATTTACGGGT
	CCAC
	ATCGAGACCCCA

For 4-base codons, length of DNA strings should be a multiple of 4. Handling of DNA 
with lengths indivisible by 4 will be implemented in the future.

For binary encoding, length of DNA strings should be a multiple of 8, as the binary
enconding of DNA utilizes 8 bits for each character in 256-ASCII.

#===================#
# Chunker Quick Use #
#===================#

Codons:

Modify the "infsta" variable in run_chunker_codon.py to the name of the FASTA
file you want to convert into DNA. You may also choose to modify the 
cfname, mfname, and trname variables. Then run the command:
	
    $ python run_chunker_codon.py

Binary:

Modify the "infsta" variable in run_chunker_binary.py to the name of the FASTA
file you want to convert into DNA. You may also choose to modify the
cfname, mfname, and trname variables. Then run the command:

    $ python run_chunker_binary.py

Both programs will produce a 150bp long chunk file (cfname), a chunk file with
front adaptors removed (mfname), and a translation file as a check for correct
encoding (trname).

#=====================#
# Codon Array Chunker #
#=====================#

- Run with contig and step size as 76
- TAG at head of dna is format $ _ _ # _ _ _
  meaning 7 characters, so 28 DNA bases
- To ensure mod 4, pad the ends of the DNA with 2 bases (106 --> 104, padded with AA at the end)
- 104-28 = 76 for the chunk size
- stuffer sequence is TGAC, corresponding to single quote '

Steps to replicate:
$ python arraychunker.py infile.fasta outfile.txt 76 76 TGAC
$ python
>>> import os
>>> infile = open("outfile.txt","r")
>>> newfile = open("new.txt","w")
>>> for line in infile:
...     newfile.write("%s\n" % line[2:])    # to account for correct reading frame since adaptor is 22 bp
...	newfile.flush()
...	os.fsync(newfile.fileno())
>>>	infile.close()
>>>	newfile.close()
$ ls
$ python
>>> from ASCIIcodons import *
>>> o = DNAToText()
>>> o.translate("new.txt","trans.txt")

#======================#
# Binary Array Chunker #
#======================#

- Run with contig and step size as 48
- TAG at head of dna is format $ _ _ # _ _ _
  meaning 7 characters, so 56 DNA bases
- To ensure mod 8, pad the ends of the DNA with 2 bases (106 --> 104, padded with AA at the end)
- 104-56 = 48 for the chunk size
- stuffer sequence is TGAC, corresponding to single quote '

Steps to replicate:
$ python binarraychunker.py infile.txt outfile.txt 48 48 ACGACTGT
$ python
>>> import os
>>> infile = open("outfile.txt","r")
>>> newfile = open("new.txt","w")
>>> for line in infile:
...     newfile.write("%s\n" % line[22:126])    # to account for correct reading frame since adaptor is 22 bp
...		newfile.flush() 						# for huge files, may stop writing, so we need to flush I/O
...		os.fsync(newfile.fileno())
>>> infile.close()
>>> newfile.close()
$ ls
$ python
>>> from binaryDNA import *
>>> o = DNAToBinaryText()
>>> o.translate("new.txt","trans.txt")

#====================================#
# Using MapReduce on Sequencing Data #
#====================================#

The algorithm can be run by executing get_unique_oligos.py.

To modify the input sequencing file, modify the path of 
"parser" in main().

To modify the output location of the results, modify the
path of "treepath" in main().

Results will be placed in a folder specified by "treepath".