Script(s)/bots that do something with the Human Genome Project's nucleotide sequences
I've downloaded *Genome Reference Consortium Human Build 38 patch release 1 from the National Center for Biotechnical Information (thanks @vogon for pointing me to this!). I was using the one from Project Gutenberg, but that was only build 34. This is in the FASTA format.
I know very little about DNA and the Human Genome
Project's, but since Project Gutenberg has nucleotide
sequences from the Genome Project I thought I'd try to come up with
interesting ways to look at them.
As best I can tell, they're in the FASTA format. I've taken a file
(started with Chromosome 1) and stripped the top Project Gutenberg text out
of it as well as the first identification line so that I'm left with only the nucleic acids. There are large
sections that have only the letter N which seems (according to the FASTA format) be unknown nucleic acids. The other
characters map to Adenine,
Cytosine, (Guanine)[http://en.wikipedia.org/wiki/Guanine], and
(Thymine)[http://en.wikipedia.org/wiki/Thymine].
All of the sequences can be downloaded from Project Gutenberg so
I'm excluding them from this repository since they're rather large.
The first thing I've tried to do is to build an Twitter bot that tweets images of portions of the DNA sequence. It takes 28,419 acids at a time and builds an image that is 840x840. Each acid it finds, maps to a color 5x5 square. This bot will tweet a section every hour. At that rate it will take about a year to finish all 248,564,422 acids (8,760 images).
This is the image.py script.