Final Project Ideas

Here are some possible final projects organized by difficulty. Feel free to propose something else. You are encouraged to work in pairs.

Challenging
- Horizontal Gene Transfer
- Missing K-mers
- Prokaryotic Gene Finding
- IMEter
- Longest ORF
Difficult
- IID
- Overlaps
- PWM
- Smith-Waterman
Complex
- Genome JSON
- Plausible Elvish
- Viterbi

Horizontal Gene Transfer

One way to find genes that have been recently added to a genome is to look for outliers in codon usage. Using a GenBank file for a complete bacterial genome (e.g. E. coli), compute the average codon usage of a genome and then compare individual genes to the genome as a whole. Outliers in codon usage bias could be genes that used to reside in a different genome.

Use regex to parse the GenBank file
Store codon usage frequency as dictionaries
Compare individual frequencies to the whole genome frequency
Use K-L or Manhattan distance for comparisons

Genome JSON

Make a JSON file format that models the structure of genes in a genome. Chromosomes contain genes. Genes contain transcripts. Transcripts have introns, exons, UTRs, and coding sequences. It's all very complicated. Once you've created an elegant model, populate it with some real data using GFF files and then answer a few questions about the entire genome.

Design a JSON structure that encapsulates all of the features
Make a program that reads GFF and populates the structure
Navigate the structure to produce a histogram of exon lengths, for example

IID

Local alignment statistics are an important model in bioinformatics because they are used by BLAST. One of the assumptions is that the letters of the sequences being aligned are IID (independent and identically distributed). How valid is this assumption? How independent are the letters from each other? How different is a single protein from the average? How different are single proteins from each other?

Get a large protein dataset (ask instructor for help if needed)
Read protein sequences into tables of amino acid frequencies
Use Kullback-Leibler or Manhattan distance to compare
Create randomly generated proteins as a null model

IMEter

Intron-mediated enhancement (IME) is the general phenomenon that introns somehow provide a general boost to gene expression. In Rose et al. 2008, the paper describes a simple method (IMEter) to determine how powerful an intron is. Recreate the code from the description in the paper.

Read the paper
Get intron data from instructor (unless you want to try doing this)
Train the IMEter with kmer frequencies
Test the IMEter with various experimentally-validated introns

Longest ORF

The public sequence database contains millions of proteins. And yet nobody really sequences proteins. Instead, they sequence mRNAs and derive the protein as the longest ORF. Create a program that reads in DNA/RNA sequences and reports the longest ORF as a protein. The program should optionally report 3-frame or 6-frame translations.

Create a function that reports all ORFs in a sequence
Choose the longest ORF
Download an mRNA dataset (get help from instructor if needed)
Compare your translations to the official translations

Missing K-mers

Which K-mers are missing in a genome?

Overlaps

Genome data is often described as a feature, which is just a part of a chromosome with a specific location. An exon is a feature, as is a SNP, or a ChIP-seq peak. Sometimes people ask "which exons overlap known mutations?" Write a couple programs that make these comparisons and note the difference in resources (time, CPU) required.

Make a random feature generator
Implement some feature comparison functions
- Simple lists
- Dictionaries
- Sorted lists
Compare time, memory, and programming effort

Plausible Elvish

When writing fantasy novels, authors must somehow come up with new names that sound like they come from a different language. Make a language generator using Nth-order Markov models. For example, generate Evlish words from a mixture of Finnish and Welsh (or whatever you think Elvish is supposed to look like).

Read books (e.g. Gutenberg Project) into tables of letter frequencies
Generate new words from the tables of frequencies

Prokaryotic Gene Finding

A convenient way to find genes in prokaryotic genomes is to say that any ORF > 100 aa codes for a protein. But how accurate is this practice? Are there a lot of fake genes in the genome because of this? Using simulations, how many genes in a genome are probably not real?

Simulate a prokaryotic genome
Make a histogram of ORF sizes
Use various cutoffs and report the number of fake genes found
Estimate the number of fake genes in E. coli (for example)

PWM

One of the most common ways to represent sequence patterns is the position weight matrix (PWM). Make a program that reads a file of sequences and outputs a PWM as an SVG (scalable vector graphic).

Download real sequences or make them up yourself (ask for help if needed)
Make a PWM data structure (2D list or list of dict)
Create SVG (without importing an SVG module)

Smith-Waterman

The Smith-Waterman algorithm is one of the classic bioinformatics algorithms for finding the maximum scoring pair between two sequences. Make a program that inputs two sequences and outputs the best alignment. Alternatively, you might do the very similar Needleman-Wunsch algorithm.

Provide command line options for match, mismatch, and gap scores
Report the score of the maximum alignment
Report the aligned sequences

Viterbi

The Viterbi algorithm is used frequently in bioinformatics. For example, it is used to find protein domains within a protein or to find genes in a genome. Write a simple Viterbi decoder.

Choose a sequence feature to model (ask instructor for help)
Decode some real sequences

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finalprojects.md

finalprojects.md

Final Project Ideas

Horizontal Gene Transfer

Genome JSON

IID

IMEter

Longest ORF

Missing K-mers

Overlaps

Plausible Elvish

Prokaryotic Gene Finding

PWM

Smith-Waterman

Viterbi

Files

finalprojects.md

Latest commit

History

finalprojects.md

File metadata and controls

Final Project Ideas

Horizontal Gene Transfer

Genome JSON

IID

IMEter

Longest ORF

Missing K-mers

Overlaps

Plausible Elvish

Prokaryotic Gene Finding

PWM

Smith-Waterman

Viterbi